[beam-site] 01/04: [BEAM-667] Verify and update wordcount snippets

mergebot-role Thu, 21 Sep 2017 14:01:40 -0700

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git


commit b288b69ecc9106728bf8f39fa7a27bd82e50f5c6
Author: melissa <[email protected]>
AuthorDate: Wed Aug 30 16:28:42 2017 -0700

    [BEAM-667] Verify and update wordcount snippets
---
 src/get-started/wordcount-example.md | 61 ++++++++++++++++++------------------
 1 file changed, 31 insertions(+), 30 deletions(-)

diff --git a/src/get-started/wordcount-example.md 
b/src/get-started/wordcount-example.md
index cf0ebc0..371e6bb 100644
--- a/src/get-started/wordcount-example.md
+++ b/src/get-started/wordcount-example.md
@@ -11,7 +11,7 @@ redirect_from: /use/wordcount-example/
 {:toc}
 
 <nav class="language-switcher">
-  <strong>Adapt for:</strong> 
+  <strong>Adapt for:</strong>
   <ul>
     <li data-type="language-java">Java SDK</li>
     <li data-type="language-py">Python SDK</li>
@@ -125,13 +125,13 @@ To view the full code in Python, see 
**[wordcount_minimal.py](https://github.com
 * Writing output (in this example: writing to a text file)
 * Running the Pipeline
 
-The following sections explain these concepts in detail along with excerpts of 
the relevant code from the Minimal WordCount pipeline.
+The following sections explain these concepts in detail, using the relevant 
code excerpts from the Minimal WordCount pipeline.
 
 ### Creating the Pipeline
 
-The first step in creating a Beam pipeline is to create a `PipelineOptions` 
object. This object lets us set various options for our pipeline, such as the 
pipeline runner that will execute our pipeline and any runner-specific 
configuration required by the chosen runner. In this example we set these 
options programmatically, but more often command-line arguments are used to set 
`PipelineOptions`. 
+The first step in creating a Beam pipeline is to create a `PipelineOptions` 
object. This object lets us set various options for our pipeline, such as the 
pipeline runner that will execute our pipeline and any runner-specific 
configuration required by the chosen runner. In this example we set these 
options programmatically, but more often, command-line arguments are used to 
set `PipelineOptions`.
 
-You can specify a runner for executing your pipeline, such as the 
`DataflowRunner` or `SparkRunner`. If you omit specifying a runner, as in this 
example, your pipeline will be executed locally using the `DirectRunner`. In 
the next sections, we will specify the pipeline's runner.
+You can specify a runner for executing your pipeline, such as the 
`DataflowRunner` or `SparkRunner`. If you omit specifying a runner, as in this 
example, your pipeline executes locally using the `DirectRunner`. In the next 
sections, we will specify the pipeline's runner.
 
 ```java
  PipelineOptions options = PipelineOptionsFactory.create();
@@ -154,7 +154,7 @@ You can specify a runner for executing your pipeline, such 
as the `DataflowRunne
 {% github_sample 
/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py 
tag:examples_wordcount_minimal_options
 %}```
 
-The next step is to create a Pipeline object with the options we've just 
constructed. The Pipeline object builds up the graph of transformations to be 
executed, associated with that particular pipeline.
+The next step is to create a `Pipeline` object with the options we've just 
constructed. The Pipeline object builds up the graph of transformations to be 
executed, associated with that particular pipeline.
 
 ```java
 Pipeline p = Pipeline.create(options);
@@ -175,7 +175,7 @@ Figure 1: The pipeline data flow.
 
 The Minimal WordCount pipeline contains five transforms:
 
-1.  A text file `Read` transform is applied to the Pipeline object itself, and 
produces a `PCollection` as output. Each element in the output PCollection 
represents one line of text from the input file. This example uses input data 
stored in a publicly accessible Google Cloud Storage bucket ("gs://").
+1.  A text file `Read` transform is applied to the `Pipeline` object itself, 
and produces a `PCollection` as output. Each element in the output 
`PCollection` represents one line of text from the input file. This example 
uses input data stored in a publicly accessible Google Cloud Storage bucket 
("gs://").
 
     ```java
     p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
@@ -209,7 +209,7 @@ The Minimal WordCount pipeline contains five transforms:
 
 3.  The SDK-provided `Count` transform is a generic transform that takes a 
`PCollection` of any type, and returns a `PCollection` of key/value pairs. Each 
key represents a unique element from the input collection, and each value 
represents the number of times that key appeared in the input collection.
 
-       In this pipeline, the input for `Count` is the `PCollection` of 
individual words generated by the previous `ParDo`, and the output is a 
`PCollection` of key/value pairs where each key represents a unique word in the 
text and the associated value is the occurrence count for each.
+    In this pipeline, the input for `Count` is the `PCollection` of individual 
words generated by the previous `ParDo`, and the output is a `PCollection` of 
key/value pairs where each key represents a unique word in the text and the 
associated value is the occurrence count for each.
 
     ```java
     .apply(Count.<String>perElement())
@@ -221,7 +221,7 @@ The Minimal WordCount pipeline contains five transforms:
 
 4.  The next transform formats each of the key/value pairs of unique words and 
occurrence counts into a printable string suitable for writing to an output 
file.
 
-       The map transform is a higher-level composite transform that 
encapsulates a simple `ParDo`. For each element in the input `PCollection`, the 
map transform applies a function that produces exactly one output element.
+    The map transform is a higher-level composite transform that encapsulates 
a simple `ParDo`. For each element in the input `PCollection`, the map 
transform applies a function that produces exactly one output element.
 
     ```java
     .apply("FormatResults", MapElements.via(new SimpleFunction<KV<String, 
Long>, String>() {
@@ -245,7 +245,7 @@ The Minimal WordCount pipeline contains five transforms:
     ```py
     {% github_sample 
/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py 
tag:examples_wordcount_minimal_write
     %}```
-    
+
 Note that the `Write` transform produces a trivial result value of type 
`PDone`, which in this case is ignored.
 
 ### Running the Pipeline
@@ -260,7 +260,7 @@ p.run().waitUntilFinish();
 {% github_sample 
/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py 
tag:examples_wordcount_minimal_run
 %}```
 
-Note that the `run` method is asynchronous. For a blocking execution instead, 
run your pipeline appending the `waitUntilFinish` method.
+Note that the `run` method is asynchronous. For a blocking execution, run your 
pipeline appending the <span class="language-java">`waitUntilFinish`</span> 
<span class="language-py">`wait_until_finish`</span> method.
 
 ## WordCount Example
 
@@ -558,11 +558,12 @@ public class DebuggingWordCount {
   public static class FilterTextFn extends DoFn<KV<String, Long>, KV<String, 
Long>> {
     ...
 
+    @ProcessElement
     public void processElement(ProcessContext c) {
       if (...) {
         ...
         LOG.debug("Matched: " + c.element().getKey());
-      } else {        
+      } else {
         ...
         LOG.trace("Did not match: " + c.element().getKey());
       }
@@ -578,15 +579,15 @@ public class DebuggingWordCount {
 
 #### Direct Runner
 
-If you execute your pipeline using `DirectRunner`, it will print the log 
messages directly to your local console.
+If you execute your pipeline using `DirectRunner`, it prints the log messages 
directly to your local console.
 
-#### Dataflow Runner
+#### Cloud Dataflow Runner
 
-If you execute your pipeline using `DataflowRunner`, you can use Stackdriver 
Logging. Stackdriver Logging aggregates the logs from all of your Dataflow 
job's workers to a single location in the Google Cloud Platform Console. You 
can use Stackdriver Logging to search and access the logs from all of the 
workers that Dataflow has spun up to complete your Dataflow job. Logging 
statements in your pipeline's `DoFn` instances will appear in Stackdriver 
Logging as your pipeline runs.
+If you execute your pipeline using `DataflowRunner`, you can use Stackdriver 
Logging. Stackdriver Logging aggregates the logs from all of your Cloud 
Dataflow job's workers to a single location in the Google Cloud Platform 
Console. You can use Stackdriver Logging to search and access the logs from all 
of the workers that Cloud Dataflow has spun up to complete your job. Logging 
statements in your pipeline's `DoFn` instances will appear in Stackdriver 
Logging as your pipeline runs.
 
-If you execute your pipeline using `DataflowRunner`, you can control the 
worker log levels. Dataflow workers that execute user code are configured to 
log to Stackdriver Logging by default at "INFO" log level and higher. You can 
override log levels for specific logging namespaces by specifying: 
`--workerLogLevelOverrides={"Name1":"Level1","Name2":"Level2",...}`. For 
example, by specifying 
`--workerLogLevelOverrides={"org.apache.beam.examples":"DEBUG"}` when executing 
this pipeline using t [...]
+If you execute your pipeline using `DataflowRunner`, you can control the 
worker log levels. Cloud Dataflow workers that execute user code are configured 
to log to Stackdriver Logging by default at "INFO" log level and higher. You 
can override log levels for specific logging namespaces by specifying: 
`--workerLogLevelOverrides={"Name1":"Level1","Name2":"Level2",...}`. For 
example, by specifying 
`--workerLogLevelOverrides={"org.apache.beam.examples":"DEBUG"}` when executing 
a pipeline usin [...]
 
-The default Dataflow worker logging configuration can be overridden by 
specifying `--defaultWorkerLogLevel=<one of TRACE, DEBUG, INFO, WARN, ERROR>`. 
For example, by specifying `--defaultWorkerLogLevel=DEBUG` when executing this 
pipeline with the Dataflow service, Cloud Logging would contain all "DEBUG" or 
higher level logs. Note that changing the default worker log level to TRACE or 
DEBUG will significantly increase the amount of logs output.
+The default Cloud Dataflow worker logging configuration can be overridden by 
specifying `--defaultWorkerLogLevel=<one of TRACE, DEBUG, INFO, WARN, ERROR>`. 
For example, by specifying `--defaultWorkerLogLevel=DEBUG` when executing a 
pipeline with the Cloud Dataflow service, Cloud Logging will contain all 
"DEBUG" or higher level logs. Note that changing the default worker log level 
to TRACE or DEBUG significantly increases the amount of logs output.
 
 #### Apache Spark Runner
 
@@ -604,7 +605,7 @@ The default Dataflow worker logging configuration can be 
overridden by specifyin
 
 `PAssert` is a set of convenient PTransforms in the style of Hamcrest's 
collection matchers that can be used when writing Pipeline level tests to 
validate the contents of PCollections. `PAssert` is best used in unit tests 
with small data sets, but is demonstrated here as a teaching tool.
 
-Below, we verify that the set of filtered words matches our expected counts. 
Note that `PAssert` does not produce any output, and pipeline will only succeed 
if all of the expectations are met. See 
[DebuggingWordCountTest](https://github.com/apache/beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/DebuggingWordCountTest.java)
 for an example unit test.
+Below, we verify that the set of filtered words matches our expected counts. 
Note that `PAssert` does not produce any output, and the pipeline only succeeds 
if all of the expectations are met. See 
[DebuggingWordCountTest](https://github.com/apache/beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/DebuggingWordCountTest.java)
 for an example unit test.
 
 ```java
 public static void main(String[] args) {
@@ -618,7 +619,7 @@ public static void main(String[] args) {
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 ## WindowedWordCount
@@ -685,7 +686,7 @@ To view the full code in Java, see 
**[WindowedWordCount](https://github.com/apac
 
 Beam allows you to create a single pipeline that can handle both bounded and 
unbounded types of input. If the input is unbounded, then all PCollections of 
the pipeline will be unbounded as well. The same goes for bounded input. If 
your input has a fixed number of elements, it's considered a 'bounded' data 
set. If your input is continuously updating, then it's considered 'unbounded'.
 
-Recall that the input for this example is a set of Shakespeare's texts, finite 
data. Therefore, this example reads bounded data from a text file:
+Recall that the input for this example is a set of Shakespeare's texts, which 
is finite data. Therefore, this example reads bounded data from a text file:
 
 ```java
 public static void main(String[] args) throws IOException {
@@ -698,19 +699,19 @@ public static void main(String[] args) throws IOException 
{
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 ### Adding Timestamps to Data
 
 Each element in a `PCollection` has an associated **timestamp**. The timestamp 
for each element is initially assigned by the source that creates the 
`PCollection` and can be adjusted by a `DoFn`. In this example the input is 
bounded. For the purpose of the example, the `DoFn` method named 
`AddTimestampsFn` (invoked by `ParDo`) will set a timestamp for each element in 
the `PCollection`.
-  
+
 ```java
-.apply(ParDo.of(new AddTimestampFn()));
+.apply(ParDo.of(new AddTimestampFn(minTimestamp, maxTimestamp)));
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 Below is the code for `AddTimestampFn`, a `DoFn` invoked by `ParDo`, that sets 
the data element of the timestamp given the element itself. For example, if the 
elements were log lines, this `ParDo` could parse the time out of the log 
string and set it as the element's timestamp. There are no timestamps inherent 
in the works of Shakespeare, so in this case we've made up random timestamps 
just to illustrate the concept. Each line of the input text will get a random 
associated timestamp some [...]
@@ -741,14 +742,14 @@ static class AddTimestampFn extends DoFn<String, String> {
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 ### Windowing
 
-Beam uses a concept called **Windowing** to subdivide a `PCollection` 
according to the timestamps of its individual elements. PTransforms that 
aggregate multiple elements, process each `PCollection` as a succession of 
multiple, finite windows, even though the entire collection itself may be of 
infinite size (unbounded).
+Beam uses a concept called **Windowing** to subdivide a `PCollection` 
according to the timestamps of its individual elements. PTransforms that 
aggregate multiple elements process each `PCollection` as a succession of 
multiple, finite windows, even though the entire collection itself may be of 
infinite size (unbounded).
 
-The `WindowedWordCount` example applies fixed-time windowing, wherein each 
window represents a fixed time interval. The fixed window size for this example 
defaults to 1 minute (you can change this with a command-line option). 
+The `WindowedWordCount` example applies fixed-time windowing, wherein each 
window represents a fixed time interval. The fixed window size for this example 
defaults to 1 minute (you can change this with a command-line option).
 
 ```java
 PCollection<String> windowedWords = input
@@ -757,7 +758,7 @@ PCollection<String> windowedWords = input
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 ### Reusing PTransforms over windowed PCollections
@@ -769,14 +770,14 @@ PCollection<KV<String, Long>> wordCounts = 
windowedWords.apply(new WordCount.Cou
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 
 ### Write Results to an Unbounded Sink
 
 When our input is unbounded, the same is true of our output `PCollection`. We 
need to make sure that we choose an appropriate, unbounded sink. Some output 
sinks support only bounded output, while others support both bounded and 
unbounded outputs. By using a `FilenamePolicy`, we can use `TextIO` to files 
that are partitioned by windows. We use a composite `PTransform` that uses such 
a policy internally to write a single sharded file per window.
 
-In this example, we stream the results to a BigQuery table. The results are 
then formatted for a BigQuery table, and then written to BigQuery using 
BigQueryIO.Write.
+In this example, we stream the results to a BigQuery table. The results are 
then formatted for a Google BigQuery table, and then written to BigQuery using 
`BigQueryIO.Write`.
 
 ```java
   wordCounts
@@ -785,6 +786,6 @@ In this example, we stream the results to a BigQuery table. 
The results are then
 ```
 
 ```py
-This feature is not yet available in the Beam SDK for Python.
+# This feature is not yet available in the Beam SDK for Python.
 ```
 

-- 
To stop receiving notification emails like this one, please contact
"[email protected]" <[email protected]>.

[beam-site] 01/04: [BEAM-667] Verify and update wordcount snippets

Reply via email to