Repository: beam Updated Branches: refs/heads/master 9575694ca -> 5e4fd1b95
Remove sdk & example README.md that are covered in web site Project: http://git-wip-us.apache.org/repos/asf/beam/repo Commit: http://git-wip-us.apache.org/repos/asf/beam/commit/672a2c40 Tree: http://git-wip-us.apache.org/repos/asf/beam/tree/672a2c40 Diff: http://git-wip-us.apache.org/repos/asf/beam/diff/672a2c40 Branch: refs/heads/master Commit: 672a2c400badd8cd33edb9585d2325b9f265c963 Parents: 9575694 Author: Ahmet Altay <[email protected]> Authored: Mon May 8 15:26:50 2017 -0700 Committer: Ahmet Altay <[email protected]> Committed: Tue May 9 09:59:15 2017 -0700 ---------------------------------------------------------------------- .gitignore | 1 + examples/java/README.md | 64 +--- .../beam/examples/complete/game/README.md | 131 -------- pom.xml | 2 + sdks/python/README.md | 298 ------------------- .../examples/complete/game/README.md | 69 ----- 6 files changed, 5 insertions(+), 560 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/.gitignore ---------------------------------------------------------------------- diff --git a/.gitignore b/.gitignore index 9cfae09..1ecb993 100644 --- a/.gitignore +++ b/.gitignore @@ -24,6 +24,7 @@ sdks/python/**/*.so sdks/python/**/*.egg sdks/python/LICENSE sdks/python/NOTICE +sdks/python/README.md # Ignore IntelliJ files. .idea/ http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/examples/java/README.md ---------------------------------------------------------------------- diff --git a/examples/java/README.md b/examples/java/README.md index d891fb8..75b70dd 100644 --- a/examples/java/README.md +++ b/examples/java/README.md @@ -40,69 +40,9 @@ demonstrates some best practices for instrumenting your pipeline code. 1. [`WindowedWordCount`](https://github.com/apache/beam/blob/master/examples/java/src/main/java/org/apache/beam/examples/WindowedWordCount.java) shows how to run the same pipeline over either unbounded PCollections in streaming mode or bounded PCollections in batch mode. -## Building and Running +## Running Examples -Change directory into `examples/java` and run the examples: - - mvn compile exec:java \ - -Dexec.mainClass=<MAIN CLASS> \ - -Dexec.args="<EXAMPLE-SPECIFIC ARGUMENTS>" - -Alternatively, you may choose to bundle all dependencies into a single JAR and -execute it outside of the Maven environment. - -### Direct Runner - -You can execute the `WordCount` pipeline on your local machine as follows: - - mvn compile exec:java \ - -Dexec.mainClass=org.apache.beam.examples.WordCount \ - -Dexec.args="--inputFile=<LOCAL INPUT FILE> --output=<LOCAL OUTPUT FILE>" - -To create the bundled JAR of the examples and execute it locally: - - mvn package - - java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \ - org.apache.beam.examples.WordCount \ - --inputFile=<INPUT FILE PATTERN> --output=<OUTPUT FILE> - -### Google Cloud Dataflow Runner - -After you have followed the general Cloud Dataflow -[prerequisites and setup](https://beam.apache.org/documentation/runners/dataflow/), you can execute -the pipeline on fully managed resources in Google Cloud Platform: - - mvn compile exec:java \ - -Dexec.mainClass=org.apache.beam.examples.WordCount \ - -Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> \ - --tempLocation=<YOUR CLOUD STORAGE LOCATION> \ - --runner=DataflowRunner" - -Make sure to use your project id, not the project number or the descriptive name. -The Google Cloud Storage location should be entered in the form of -`gs://bucket/path/to/staging/directory`. - -To create the bundled JAR of the examples and execute it in Google Cloud Platform: - - mvn package - - java -cp examples/java/target/beam-examples-java-bundled-<VERSION>.jar \ - org.apache.beam.examples.WordCount \ - --project=<YOUR CLOUD PLATFORM PROJECT ID> \ - --tempLocation=<YOUR CLOUD STORAGE LOCATION> \ - --runner=DataflowRunner - -## Other Examples - -Other examples can be run similarly by replacing the `WordCount` class path with the example classpath, e.g. -`org.apache.beam.examples.cookbook.CombinePerKeyExamples`, -and adjusting runtime options under the `Dexec.args` parameter, as specified in -the example itself. - -Note that when running Maven on Microsoft Windows platform, backslashes (`\`) -under the `Dexec.args` parameter should be escaped with another backslash. For -example, input file pattern of `c:\*.txt` should be entered as `c:\\*.txt`. +See [Apache Beam WordCount Example](https://beam.apache.org/get-started/wordcount-example/) for information on running these examples. ## Beyond Word Count http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md ---------------------------------------------------------------------- diff --git a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md b/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md deleted file mode 100644 index fdce05c..0000000 --- a/examples/java8/src/main/java/org/apache/beam/examples/complete/game/README.md +++ /dev/null @@ -1,131 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> - -# 'Gaming' examples - - -This directory holds a series of example Apache Beam pipelines in a simple 'mobile -gaming' domain. They all require Java 8. Each pipeline successively introduces -new concepts, and gives some examples of using Java 8 syntax in constructing -Beam pipelines. Other than usage of Java 8 lambda expressions, the concepts -that are used apply equally well in Java 7. - -In the gaming scenario, many users play, as members of different teams, over -the course of a day, and their actions are logged for processing. Some of the -logged game events may be late-arriving, if users play on mobile devices and go -transiently offline for a period. - -The scenario includes not only "regular" users, but "robot users", which have a -higher click rate than the regular users, and may move from team to team. - -The first two pipelines in the series use pre-generated batch data samples. The -second two pipelines read from a [PubSub](https://cloud.google.com/pubsub/) -topic input. For these examples, you will also need to run the -`injector.Injector` program, which generates and publishes the gaming data to -PubSub. The javadocs for each pipeline have more detailed information on how to -run that pipeline. - -All of these pipelines write their results to BigQuery table(s). - - -## The pipelines in the 'gaming' series - -### UserScore - -The first pipeline in the series is `UserScore`. This pipeline does batch -processing of data collected from gaming events. It calculates the sum of -scores per user, over an entire batch of gaming data (collected, say, for each -day). The batch processing will not include any late data that arrives after -the day's cutoff point. - -### HourlyTeamScore - -The next pipeline in the series is `HourlyTeamScore`. This pipeline also -processes data collected from gaming events in batch. It builds on `UserScore`, -but uses [fixed windows](https://beam.apache.org/documentation/programming-guide/#windowing), by -default an hour in duration. It calculates the sum of scores per team, for each -window, optionally allowing specification of two timestamps before and after -which data is filtered out. This allows a model where late data collected after -the intended analysis window can be included in the analysis, and any late- -arriving data prior to the beginning of the analysis window can be removed as -well. - -By using windowing and adding element timestamps, we can do finer-grained -analysis than with the `UserScore` pipeline â we're now tracking scores for -each hour rather than over the course of a whole day. However, our batch -processing is high-latency, in that we don't get results from plays at the -beginning of the batch's time period until the complete batch is processed. - -### LeaderBoard - -The third pipeline in the series is `LeaderBoard`. This pipeline processes an -unbounded stream of 'game events' from a PubSub topic. The calculation of the -team scores uses fixed windowing based on event time (the time of the game play -event), not processing time (the time that an event is processed by the -pipeline). The pipeline calculates the sum of scores per team, for each window. -By default, the team scores are calculated using one-hour windows. - -In contrast â to demo another windowing option â the user scores are calculated -using a global window, which periodically (every ten minutes) emits cumulative -user score sums. - -In contrast to the previous pipelines in the series, which used static, finite -input data, here we're using an unbounded data source, which lets us provide -_speculative_ results, and allows handling of late data, at much lower latency. -E.g., we could use the early/speculative results to keep a 'leaderboard' -updated in near-realtime. Our handling of late data lets us generate correct -results, e.g. for 'team prizes'. We're now outputing window results as they're -calculated, giving us much lower latency than with the previous batch examples. - -### GameStats - -The fourth pipeline in the series is `GameStats`. This pipeline builds -on the `LeaderBoard` functionality â supporting output of speculative and late -data â and adds some "business intelligence" analysis: identifying abuse -detection. The pipeline derives the Mean user score sum for a window, and uses -that information to identify likely spammers/robots. (The injector is designed -so that the "robots" have a higher click rate than the "real" users). The robot -users are then filtered out when calculating the team scores. - -Additionally, user sessions are tracked: that is, we find bursts of user -activity using session windows. Then, the mean session duration information is -recorded in the context of subsequent fixed windowing. (This could be used to -tell us what games are giving us greater user retention). - -### Running the PubSub Injector - -The `LeaderBoard` and `GameStats` example pipelines read unbounded data -from a PubSub topic. - -Use the `injector.Injector` program to generate this data and publish to a -PubSub topic. See the `Injector`javadocs for more information on how to run the -injector. Set up the injector before you start one of these pipelines. Then, -when you start the pipeline, pass as an argument the name of that PubSub topic. -See the pipeline javadocs for the details. - -## Viewing the results in BigQuery - -All of the pipelines write their results to BigQuery. `UserScore` and -`HourlyTeamScore` each write one table, and `LeaderBoard` and -`GameStats` each write two. The pipelines have default table names that -you can override when you start up the pipeline if those tables already exist. - -Depending on the windowing intervals defined in a given pipeline, you may have -to wait for a while (more than an hour) before you start to see results written -to the BigQuery tables. http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/pom.xml ---------------------------------------------------------------------- diff --git a/pom.xml b/pom.xml index e3fa7d8..9ca876e 100644 --- a/pom.xml +++ b/pom.xml @@ -1543,6 +1543,7 @@ <include>**/*.egg-info/</include> <include>**/sdks/python/LICENSE</include> <include>**/sdks/python/NOTICE</include> + <include>**/sdks/python/README.md</include> </includes> <followSymlinks>false</followSymlinks> </fileset> @@ -1680,6 +1681,7 @@ <includes> <include>LICENSE</include> <include>NOTICE</include> + <include>README.md</include> </includes> </resource> </resources> http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/sdks/python/README.md ---------------------------------------------------------------------- diff --git a/sdks/python/README.md b/sdks/python/README.md deleted file mode 100644 index 7836a04..0000000 --- a/sdks/python/README.md +++ /dev/null @@ -1,298 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> -# Apache Beam - Python SDK - -[Apache Beam](http://beam.apache.org) is a unified model for defining both batch and streaming data-parallel processing pipelines. Beam provides a set of language-specific SDKs for constructing pipelines. These pipelines can be executed on distributed processing backends like [Apache Spark](http://spark.apache.org/), [Apache Flink](http://flink.apache.org), and [Google Cloud Dataflow](http://cloud.google.com/dataflow). - -Apache Beam for Python provides access to Beam capabilities from the Python programming language. - -## Table of Contents - * [Overview of the Beam Programming Model](#overview-of-the-programming-model) - * [Getting Started](#getting-started) - * [A Quick Tour of the Source Code](#a-quick-tour-of-the-source-code) - * [Simple Examples](#simple-examples) - * [Basic pipeline](#basic-pipeline) - * [Basic pipeline (with Map)](#basic-pipeline-with-map) - * [Basic pipeline (with FlatMap)](#basic-pipeline-with-flatmap) - * [Basic pipeline (with FlatMap and yield)](#basic-pipeline-with-flatmap-and-yield) - * [Counting words](#counting-words) - * [Counting words with GroupByKey](#counting-words-with-groupbykey) - * [Type hints](#type-hints) - * [BigQuery](#bigquery) - * [Combiner examples](#combiner-examples) - * [Organizing Your Code](#organizing-your-code) - * [Contact Us](#contact-us) - -## Overview of the Programming Model - -The key concepts of the programming model are: - -* PCollection - represents a collection of data, which could be bounded or unbounded in size. -* PTransform - represents a computation that transforms input PCollections into output -PCollections. -* Pipeline - manages a directed acyclic graph of PTransforms and PCollections that is ready -for execution. -* Runner - specifies where and how the Pipeline should execute. - -For a further, detailed introduction, please read the -[Beam Programming Model](http://beam.apache.org/documentation/programming-guide). - -## Getting Started - -See [Apache Beam Python SDK Quickstart](https://beam.apache.org/get-started/quickstart-py/). - -## A Quick Tour of the Source Code - -With your virtual environment active, you can follow along this tour by running a `pydoc` server on a local port of your choosing (this example uses port 8888): - -``` -pydoc -p 8888 -``` - -Open your browser and go to -http://localhost:8888/apache_beam.html - -Some interesting classes to navigate to: - -* `PCollection`, in file -[`apache_beam/pvalue.py`](http://localhost:8888/apache_beam.pvalue.html) -* `PTransform`, in file -[`apache_beam/transforms/ptransform.py`](http://localhost:8888/apache_beamtransforms.ptransform.html) -* `FlatMap`, `GroupByKey`, and `Map`, in file -[`apache_beam/transforms/core.py`](http://localhost:8888/apache_beam.transforms.core.html) -* combiners, in file -[`apache_beam/transforms/combiners.py`](http://localhost:8888/apache_beam.transforms.combiners.html) - -Make sure you installed the package first. If not, run `python setup.py install`, then run pydoc with `pydoc -p 8888`. - -## Simple Examples - -The following examples demonstrate some basic, fundamental concepts for using Apache Beam's Python SDK. For more detailed examples, Beam provides a [directory of examples](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples) for Python. - -### Basic pipeline - -A basic pipeline will take as input an iterable, apply the -beam.Create `PTransform`, and produce a `PCollection` that can -be written to a file or modified by further `PTransform`s. -The `>>` operator is used to label `PTransform`s and -the `|` operator is used to chain them. - -```python -# Standard imports -import apache_beam as beam -# Create a pipeline executing on a direct runner (local, non-cloud). -p = beam.Pipeline('DirectRunner') -# Create a PCollection with names and write it to a file. -(p - | 'add names' >> beam.Create(['Ann', 'Joe']) - | 'save' >> beam.io.WriteToText('./names')) -# Execute the pipeline. -p.run() -``` - -### Basic pipeline (with Map) - -The `Map` `PTransform` returns one output per input. It takes a callable that is applied to each element of the input `PCollection` and returns an element to the output `PCollection`. - -```python -import apache_beam as beam -p = beam.Pipeline('DirectRunner') -# Read a file containing names, add a greeting to each name, and write to a file. -(p - | 'load names' >> beam.io.ReadFromText('./names') - | 'add greeting' >> beam.Map(lambda name, msg: '%s, %s!' % (msg, name), 'Hello') - | 'save' >> beam.io.WriteToText('./greetings')) -p.run() -``` - -### Basic pipeline (with FlatMap) - -A `FlatMap` is like a `Map` except its callable returns a (possibly -empty) iterable of elements for the output `PCollection`. - -The `FlatMap` transform returns zero to many output per input. It accepts a callable that is applied to each element of the input `PCollection` and returns an iterable with zero or more elements to the output `PCollection`. - -```python -import apache_beam as beam -p = beam.Pipeline('DirectRunner') -# Read a file containing names, add two greetings to each name, and write to a file. -(p - | 'load names' >> beam.io.ReadFromText('./names') - | 'add greetings' >> beam.FlatMap( - lambda name, messages: ['%s %s!' % (msg, name) for msg in messages], - ['Hello', 'Hola']) - | 'save' >> beam.io.WriteToText('./greetings')) -p.run() -``` - -### Basic pipeline (with FlatMap and yield) - -The callable of a `FlatMap` can be a generator, that is, -a function using `yield`. - -```python -import apache_beam as beam -p = beam.Pipeline('DirectRunner') -# Read a file containing names, add two greetings to each name -# (with FlatMap using a yield generator), and write to a file. -def add_greetings(name, messages): - for msg in messages: - yield '%s %s!' % (msg, name) - -(p - | 'load names' >> beam.io.ReadFromText('./names') - | 'add greetings' >> beam.FlatMap(add_greetings, ['Hello', 'Hola']) - | 'save' >> beam.io.WriteToText('./greetings')) -p.run() -``` - -### Counting words - -This example shows how to read a text file from [Google Cloud Storage](https://cloud.google.com/storage/) and count its words. - -```python -import re -import apache_beam as beam -p = beam.Pipeline('DirectRunner') -(p - | 'read' >> beam.io.ReadFromText('gs://dataflow-samples/shakespeare/kinglear.txt') - | 'split' >> beam.FlatMap(lambda x: re.findall(r'\w+', x)) - | 'count words' >> beam.combiners.Count.PerElement() - | 'save' >> beam.io.WriteToText('./word_count')) -p.run() -``` - -### Counting words with GroupByKey - -This is a somewhat forced example of `GroupByKey` to count words as the previous example did, but without using `beam.combiners.Count.PerElement`. As shown in the example, you can use a wildcard to specify the text file source. - -```python -import re -import apache_beam as beam -p = beam.Pipeline('DirectRunner') -class MyCountTransform(beam.PTransform): - def expand(self, pcoll): - return (pcoll - | 'one word' >> beam.Map(lambda word: (word, 1)) - # GroupByKey accepts a PCollection of (word, 1) elements and - # outputs a PCollection of (word, [1, 1, ...]) - | 'group words' >> beam.GroupByKey() - | 'count words' >> beam.Map(lambda (word, counts): (word, len(counts)))) - -(p - | 'read' >> beam.io.ReadFromText('./names*') - | 'split' >> beam.FlatMap(lambda x: re.findall(r'\w+', x)) - | MyCountTransform() - | 'write' >> beam.io.WriteToText('./word_count')) -p.run() -``` - -### Type hints - -In some cases, providing type hints can improve the efficiency -of the data encoding. - -```python -import apache_beam as beam -from apache_beam.typehints import typehints -p = beam.Pipeline('DirectRunner') -(p - | 'read' >> beam.io.ReadFromText('./names') - | 'add types' >> beam.Map(lambda x: (x, 1)).with_output_types(typehints.KV[str, int]) - | 'group words' >> beam.GroupByKey() - | 'save' >> beam.io.WriteToText('./typed_names')) -p.run() -``` - -### BigQuery - -This example reads weather data from a BigQuery table, calculates the number of tornadoes per month, and writes the results to a table you specify. - -```python -import apache_beam as beam -project = 'DESTINATION-PROJECT-ID' -input_table = 'clouddataflow-readonly:samples.weather_stations' -output_table = 'DESTINATION-DATASET.DESTINATION-TABLE' - -p = beam.Pipeline(argv=['--project', project]) -(p - | 'read' >> beam.Read(beam.io.BigQuerySource(input_table)) - | 'months with tornadoes' >> beam.FlatMap( - lambda row: [(int(row['month']), 1)] if row['tornado'] else []) - | 'monthly count' >> beam.CombinePerKey(sum) - | 'format' >> beam.Map(lambda (k, v): {'month': k, 'tornado_count': v}) - | 'save' >> beam.Write( - beam.io.BigQuerySink( - output_table, - schema='month:INTEGER, tornado_count:INTEGER', - create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, - write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))) -p.run() -``` - -This pipeline, like the one above, calculates the number of tornadoes per month, but it uses a query to filter out the input instead of using the whole table. - -```python -import apache_beam as beam -project = 'DESTINATION-PROJECT-ID' -output_table = 'DESTINATION-DATASET.DESTINATION-TABLE' -input_query = 'SELECT month, COUNT(month) AS tornado_count ' \ - 'FROM [clouddataflow-readonly:samples.weather_stations] ' \ - 'WHERE tornado=true GROUP BY month' -p = beam.Pipeline(argv=['--project', project]) -(p - | 'read' >> beam.Read(beam.io.BigQuerySource(query=input_query)) - | 'save' >> beam.Write(beam.io.BigQuerySink( - output_table, - schema='month:INTEGER, tornado_count:INTEGER', - create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED, - write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))) -p.run() -``` - -### Combiner Examples - -Combiner transforms use "reducing" functions, such as sum, min, or max, to combine multiple values of a `PCollection` into a single value. - -```python -import apache_beam as beam -p = beam.Pipeline('DirectRunner') - -SAMPLE_DATA = [('a', 1), ('b', 10), ('a', 2), ('a', 3), ('b', 20)] - -(p - | beam.Create(SAMPLE_DATA) - | beam.CombinePerKey(sum) - | beam.io.WriteToText('./sums')) -p.run() -``` - -The [combiners_test.py](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/combiners_test.py) file contains more combiner examples. - -## Organizing Your Code - -Many projects will grow to multiple source code files. It is recommended that you organize your project so that all code involved in running your pipeline can be built as a Python package. This way, the package can easily be installed in the VM workers executing the job. - -Follow the [Juliaset example](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset). If the code is organized in this fashion, you can use the `--setup_file` command line option to create a source distribution out of the project files, stage the resulting tarball, and later install it in the workers executing the job. - -## More Information - -Please report any issues on [JIRA](https://issues.apache.org/jira/browse/BEAM/component/12328910). - -If youâre interested in contributing to the Beam SDK, start by reading the [Contribute](http://beam.apache.org/contribute/) guide. http://git-wip-us.apache.org/repos/asf/beam/blob/672a2c40/sdks/python/apache_beam/examples/complete/game/README.md ---------------------------------------------------------------------- diff --git a/sdks/python/apache_beam/examples/complete/game/README.md b/sdks/python/apache_beam/examples/complete/game/README.md deleted file mode 100644 index 39677e4..0000000 --- a/sdks/python/apache_beam/examples/complete/game/README.md +++ /dev/null @@ -1,69 +0,0 @@ -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> -# 'Gaming' examples - -This directory holds a series of example Dataflow pipelines in a simple 'mobile -gaming' domain. Each pipeline successively introduces new concepts. - -In the gaming scenario, many users play, as members of different teams, over -the course of a day, and their actions are logged for processing. Some of the -logged game events may be late-arriving, if users play on mobile devices and go -transiently offline for a period. - -The scenario includes not only "regular" users, but "robot users", which have a -higher click rate than the regular users, and may move from team to team. - -The first two pipelines in the series use pre-generated batch data samples. - -All of these pipelines write their results to Google BigQuery table(s). - -## The pipelines in the 'gaming' series - -### user_score - -The first pipeline in the series is `user_score`. This pipeline does batch -processing of data collected from gaming events. It calculates the sum of -scores per user, over an entire batch of gaming data (collected, say, for each -day). The batch processing will not include any late data that arrives after -the day's cutoff point. - -### hourly_team_score - -The next pipeline in the series is `hourly_team_score`. This pipeline also -processes data collected from gaming events in batch. It builds on `user_score`, -but uses [fixed windows](https://beam.apache.org/documentation/programming-guide/#windowing), -by default an hour in duration. It calculates the sum of scores per team, for -each window, optionally allowing specification of two timestamps before and -after which data is filtered out. This allows a model where late data collected -after the intended analysis window can be included in the analysis, and any -late-arriving data prior to the beginning of the analysis window can be removed -as well. - -By using windowing and adding element timestamps, we can do finer-grained -analysis than with the `UserScore` pipeline â we're now tracking scores for -each hour rather than over the course of a whole day. However, our batch -processing is high-latency, in that we don't get results from plays at the -beginning of the batch's time period until the complete batch is processed. - -## Viewing the results in BigQuery - -All of the pipelines write their results to BigQuery. `user_score` and -`hourly_team_score` each write one table. The pipelines have default table names -that you can override when you start up the pipeline if those tables already -exist.
