Repository: beam Updated Branches: refs/heads/master ddc75958c -> 9ca6511a7
[BEAM-1771] Clean up dataflow/google references/URLs in examples Project: http://git-wip-us.apache.org/repos/asf/beam/repo Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/81bcbb4d Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/81bcbb4d Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/81bcbb4d Branch: refs/heads/master Commit: 81bcbb4d19843a9f404ed9c8bffb64f66e41fb6d Parents: 4ffd43e Author: melissa <[email protected]> Authored: Mon Mar 20 19:35:22 2017 -0700 Committer: melissa <[email protected]> Committed: Mon Mar 20 19:49:45 2017 -0700 ---------------------------------------------------------------------- examples/java/README.md | 61 +++++++++++--------- .../beam/examples/DebuggingWordCount.java | 2 +- .../org/apache/beam/examples/cookbook/README.md | 2 +- .../beam/examples/complete/game/README.md | 6 +- .../complete/game/injector/Injector.java | 3 +- .../examples/complete/game/GameStatsTest.java | 2 +- .../complete/game/HourlyTeamScoreTest.java | 2 +- 7 files changed, 42 insertions(+), 36 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java/README.md ---------------------------------------------------------------------- diff --git a/examples/java/README.md b/examples/java/README.md index b58153e..d891fb8 100644 --- a/examples/java/README.md +++ b/examples/java/README.md @@ -20,25 +20,23 @@ # Example Pipelines The examples included in this module serve to demonstrate the basic -functionality of Google Cloud Dataflow, and act as starting points for +functionality of Apache Beam, and act as starting points for the development of more complex pipelines. ## Word Count A good starting point for new users is our set of -[word count](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples) examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying [walkthrough](https://cloud.google.com/dataflow/examples/wordcount-example). +[word count](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples) examples, which computes word frequencies. This series of four successively more detailed pipelines is described in detail in the accompanying [walkthrough](https://beam.apache.org/get-started/wordcount-example/). -1. [`MinimalWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java) is the simplest word count pipeline and introduces basic concepts like [Pipelines](https://cloud.google.com/dataflow/model/pipelines), -[PCollections](https://cloud.google.com/dataflow/model/pcollection), -[ParDo](https://cloud.google.com/dataflow/model/par-do), -and [reading and writing data](https://cloud.google.com/dataflow/model/reading-and-writing-data) from external storage. +1. [`MinimalWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/MinimalWordCount.java) is the simplest word count pipeline and introduces basic concepts like [Pipelines](https://beam.apache.org/documentation/programming-guide/#pipeline), +[PCollections](https://beam.apache.org/documentation/programming-guide/#pcollection), +[ParDo](https://beam.apache.org/documentation/programming-guide/#transforms-pardo), +and [reading and writing data](https://beam.apache.org/documentation/programming-guide/#io) from external storage. -1. [`WordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java) introduces Dataflow best practices like [PipelineOptions](https://cloud.google.com/dataflow/pipelines/constructing-your-pipeline#Creating) and custom [PTransforms](https://cloud.google.com/dataflow/model/composite-transforms). +1. [`WordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WordCount.java) introduces best practices like [PipelineOptions](https://beam.apache.org/documentation/programming-guide/#pipeline) and custom [PTransforms](https://beam.apache.org/documentation/programming-guide/#transforms-composite). 1. [`DebuggingWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java) -shows how to view live aggregators in the [Dataflow Monitoring Interface](https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf), get the most out of -[Cloud Logging](https://cloud.google.com/dataflow/pipelines/logging) integration, and start writing -[good tests](https://cloud.google.com/dataflow/pipelines/testing-your-pipeline). +demonstrates some best practices for instrumenting your pipeline code. 1. [`WindowedWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java) shows how to run the same pipeline over either unbounded PCollections in streaming mode or bounded PCollections in batch mode. @@ -50,46 +48,55 @@ Change directory into `examples/java` and run the examples: -Dexec.mainClass=<MAIN CLASS> \ -Dexec.args="<EXAMPLE-SPECIFIC ARGUMENTS>" -For example, you can execute the `WordCount` pipeline on your local machine as follows: +Alternatively, you may choose to bundle all dependencies into a single JAR and +execute it outside of the Maven environment. + +### Direct Runner + +You can execute the `WordCount` pipeline on your local machine as follows: mvn compile exec:java \ -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--inputFile=<LOCAL INPUT FILE> --output=<LOCAL OUTPUT FILE>" -Once you have followed the general Cloud Dataflow -[Getting Started](https://cloud.google.com/dataflow/getting-started) instructions, you can execute -the same pipeline on fully managed resources in Google Cloud Platform: +To create the bundled JAR of the examples and execute it locally: + + mvn package + + java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \ + org.apache.beam.examples.WordCount \ + --inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE> + +### Google Cloud Dataflow Runner + +After you have followed the general Cloud Dataflow +[prerequisites and setup](https://beam.apache.org/documentation/runners/dataflow/), you can execute +the pipeline on fully managed resources in Google Cloud Platform: mvn compile exec:java \ -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \ --tempLocation=<YOUR CLOUD STORAGE LOCATION> \ - --runner=BlockingDataflowRunner" + --runner=DataflowRunner" Make sure to use your project id, not the project number or the descriptive name. -The Cloud Storage location should be entered in the form of +The Google Cloud Storage location should be entered in the form of `gs://bucket/path/to/staging/directory`. -Alternatively, you may choose to bundle all dependencies into a single JAR and -execute it outside of the Maven environment. For example, you can execute the -following commands to create the -bundled JAR of the examples and execute it both locally and in Cloud -Platform: +To create the bundled JAR of the examples and execute it in Google Cloud Platform: mvn package java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \ org.apache.beam.examples.WordCount \ - --inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE> - - java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \ - org.apache.beam.examples.WordCount \ --project=<YOUR CLOUD PLATFORM PROJECT ID> \ --tempLocation=<YOUR CLOUD STORAGE LOCATION> \ - --runner=BlockingDataflowRunner + --runner=DataflowRunner + +## Other Examples Other examples can be run similarly by replacing the `WordCount` class path with the example classpath, e.g. -`org.apache.beam.examples.cookbook.BigQueryTornadoes`, +`org.apache.beam.examples.cookbook.CombinePerKeyExamples`, and adjusting runtime options under the `Dexec.args` parameter, as specified in the example itself. http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java ---------------------------------------------------------------------- diff --git a/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java b/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java index f7493f3..031f317 100644 --- a/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java +++ b/examples/java/src/main/java/org/apache/beam/examples/DebuggingWordCount.java @@ -151,7 +151,7 @@ public class DebuggingWordCount { * <p>Below we verify that the set of filtered words matches our expected counts. Note * that PAssert does not provide any output and that successful completion of the * Pipeline implies that the expectations were met. Learn more at - * https://cloud.google.com/dataflow/pipelines/testing-your-pipeline on how to test + * https://beam.apache.org/documentation/pipelines/test-your-pipeline/ on how to test * your Pipeline and see {@link DebuggingWordCountTest} for an example unit test. */ List<KV<String, Long>> expectedResults = Arrays.asList( http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java/src/main/java/org/apache/beam/examples/cookbook/README.md ---------------------------------------------------------------------- diff --git a/examples/java/src/main/java/org/apache/beam/examples/cookbook/README.md b/examples/java/src/main/java/org/apache/beam/examples/cookbook/README.md index dab4e73..b167cd7 100644 --- a/examples/java/src/main/java/org/apache/beam/examples/cookbook/README.md +++ b/examples/java/src/main/java/org/apache/beam/examples/cookbook/README.md @@ -21,7 +21,7 @@ This directory holds simple "cookbook" examples, which show how to define commonly-used data analysis patterns that you would likely incorporate into a -larger Dataflow pipeline. They include: +larger Apache Beam pipeline. They include: <ul> <li><a href="https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/cookbook/BigQueryTornadoes.java">BigQueryTornadoes</a> http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md ---------------------------------------------------------------------- diff --git a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md b/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md index 25e31f5..fdce05c 100644 --- a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md +++ b/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md @@ -20,10 +20,10 @@ # 'Gaming' examples -This directory holds a series of example Dataflow pipelines in a simple 'mobile +This directory holds a series of example Apache Beam pipelines in a simple 'mobile gaming' domain. They all require Java 8. Each pipeline successively introduces new concepts, and gives some examples of using Java 8 syntax in constructing -Dataflow pipelines. Other than usage of Java 8 lambda expressions, the concepts +Beam pipelines. Other than usage of Java 8 lambda expressions, the concepts that are used apply equally well in Java 7. In the gaming scenario, many users play, as members of different teams, over @@ -58,7 +58,7 @@ the day's cutoff point. The next pipeline in the series is `HourlyTeamScore`. This pipeline also processes data collected from gaming events in batch. It builds on `UserScore`, -but uses [fixed windows](https://cloud.google.com/dataflow/model/windowing), by +but uses [fixed windows](https://beam.apache.org/documentation/programming-guide/#windowing), by default an hour in duration. It calculates the sum of scores per team, for each window, optionally allowing specification of two timestamps before and after which data is filtered out. This allows a model where late data collected after http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java8/src/main/java/org/apache/beam/examples/complete/game/injector/Injector.java ---------------------------------------------------------------------- diff --git a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/injector/Injector.java b/examples/java8/src/main/java/org/apache/beam/examples/complete/game/injector/Injector.java index 8c23cd7..b9a3ff2 100644 --- a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/injector/Injector.java +++ b/examples/java8/src/main/java/org/apache/beam/examples/complete/game/injector/Injector.java @@ -260,8 +260,7 @@ class Injector { user = team.getRandomUser(); } String event = user + "," + teamName + "," + random.nextInt(MAX_SCORE); - // Randomly introduce occasional parse errors. You can see a custom counter tracking the number - // of such errors in the Dataflow Monitoring UI, as the example pipeline runs. + // Randomly introduce occasional parse errors. if (random.nextInt(parseErrorRate) == 0) { System.out.println("Introducing a parse error."); event = "THIS LINE REPRESENTS CORRUPT DATA AND WILL CAUSE A PARSE ERROR"; http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java8/src/test/java/org/apache/beam/examples/complete/game/GameStatsTest.java ---------------------------------------------------------------------- diff --git a/examples/java8/src/test/java/org/apache/beam/examples/complete/game/GameStatsTest.java b/examples/java8/src/test/java/org/apache/beam/examples/complete/game/GameStatsTest.java index da2bb91..36cf9bc 100644 --- a/examples/java8/src/test/java/org/apache/beam/examples/complete/game/GameStatsTest.java +++ b/examples/java8/src/test/java/org/apache/beam/examples/complete/game/GameStatsTest.java @@ -38,7 +38,7 @@ import org.junit.runners.JUnit4; * Tests of GameStats. * Because the pipeline was designed for easy readability and explanations, it lacks good * modularity for testing. See our testing documentation for better ideas: - * https://cloud.google.com/dataflow/pipelines/testing-your-pipeline. + * https://beam.apache.org/documentation/pipelines/test-your-pipeline/ */ @RunWith(JUnit4.class) public class GameStatsTest implements Serializable { http://git-wip-us.apache.org/repos/asf/beam/blob/81bcbb4d/examples/java8/src/test/java/org/apache/beam/examples/complete/game/HourlyTeamScoreTest.java ---------------------------------------------------------------------- diff --git a/examples/java8/src/test/java/org/apache/beam/examples/complete/game/HourlyTeamScoreTest.java b/examples/java8/src/test/java/org/apache/beam/examples/complete/game/HourlyTeamScoreTest.java index 34a0744..5fc94a5 100644 --- a/examples/java8/src/test/java/org/apache/beam/examples/complete/game/HourlyTeamScoreTest.java +++ b/examples/java8/src/test/java/org/apache/beam/examples/complete/game/HourlyTeamScoreTest.java @@ -45,7 +45,7 @@ import org.junit.runners.JUnit4; * Tests of HourlyTeamScore. * Because the pipeline was designed for easy readability and explanations, it lacks good * modularity for testing. See our testing documentation for better ideas: - * https://cloud.google.com/dataflow/pipelines/testing-your-pipeline. + * https://beam.apache.org/documentation/pipelines/test-your-pipeline/ */ @RunWith(JUnit4.class) public class HourlyTeamScoreTest implements Serializable {
