[
https://issues.apache.org/jira/browse/BEAM-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297653#comment-16297653
]
ASF GitHub Bot commented on BEAM-3374:
--------------------------------------
asfgit closed pull request #368: [BEAM-3374] Fix missing/stretched images,
improve alt text
URL: https://github.com/apache/beam-site/pull/368
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/src/contribute/contribution-guide.md
b/src/contribute/contribution-guide.md
index bb6a589e3..5a7f0b9c6 100644
--- a/src/contribute/contribution-guide.md
+++ b/src/contribute/contribution-guide.md
@@ -17,7 +17,8 @@ or participate on the documentation effort.
We use a review-then-commit workflow in Beam for all contributions.
-
+
**For larger contributions or those that affect multiple components:**
diff --git a/src/contribute/maturity-model.md b/src/contribute/maturity-model.md
index 60593a49b..429b04abf 100644
--- a/src/contribute/maturity-model.md
+++ b/src/contribute/maturity-model.md
@@ -258,7 +258,8 @@ While the majority of commits is still provided by a single
organization, it is
Finally, the contributor diversity has increased significantly. Over each of
the last three months, no organization has had more than ~50% of the unique
contributors per month. (Assumptions: commits to master branch of the main
repository, excludes merge commits, best effort to identify unique
contributors).
-
+
## Dependency analysis
This section analyses project's direct and transitive dependencies to ensure
compliance with Apache Software Foundation's policies and guidelines.
diff --git a/src/documentation/execution-model.md
b/src/documentation/execution-model.md
index 4f839cad3..11b53b8e4 100644
--- a/src/documentation/execution-model.md
+++ b/src/documentation/execution-model.md
@@ -11,9 +11,6 @@ The Beam model allows runners to execute your pipeline in
different ways. You
may observe various effects as a result of the runner’s choices. This page
describes these effects so you can better understand how Beam pipelines
execute.
-* toc
-{:toc}
-
## Processing of elements
The serialization and communication of elements between machines is one of the
@@ -78,27 +75,28 @@ in parallel, and how transforms are retried when failures
occur.
When executing a single `ParDo`, a runner might divide an example input
collection of nine elements into two bundles as shown in figure 1.
-
+
-**Figure 1:** A runner divides an input collection with nine elements
-into two bundles.
+*Figure 1: A runner divides an input collection into two bundles.*
When the `ParDo` executes, workers may process the two bundles in parallel as
shown in figure 2.
-
+
-**Figure 2:** Two workers process the two bundles in parallel. The elements in
-each bundle are processed in sequence.
+*Figure 2: Two workers process the two bundles in parallel.*
Since elements cannot be split, the maximum parallelism for a transform depends
-on the number of elements in the collection. In our example, the input
-collection has nine elements, so the maximum parallelism is nine.
+on the number of elements in the collection. In figure 3, the input collection
+has nine elements, so the maximum parallelism is nine.
-
+
-**Figure 3:** The maximum parallelism is nine, as there are nine elements in
the
-input collection.
+*Figure 3: Nine workers process a nine element input collection in parallel.*
Note: Splittable ParDo allows splitting the processing of a single input across
multiple bundles. This feature is a work in progress.
@@ -111,9 +109,11 @@ output elements without altering the bundling. In figure
4, `ParDo1` and
`ParDo2` are _dependently parallel_ if the output of `ParDo1` for a given
element must be processed on the same worker.
-
+
-**Figure 4:** Two transforms in sequence and their corresponding input
collections.
+*Figure 4: Two transforms in sequence and their corresponding input
collections.*
Figure 5 shows how these dependently parallel transforms might execute. The
first worker executes `ParDo1` on the elements in bundle A (which results in
@@ -121,9 +121,11 @@ bundle C), and then executes `ParDo2` on the elements in
bundle C. Similarly,
the second worker executes `ParDo1` on the elements in bundle B (which results
in bundle D), and then executes `ParDo2` on the elements in bundle D.
-
+
-**Figure 5:** Two workers execute dependently parallel ParDo transforms.
+*Figure 5: Two workers execute dependently parallel ParDo transforms.*
Executing transforms this way allows a runner to avoid redistributing elements
between workers, which saves on communication costs. However, the maximum
parallelism
@@ -147,12 +149,14 @@ there is one element still awaiting processing.
We see that the runner retries all elements in bundle B and the processing
completes successfully the second time. Note that the retry does not
necessarily
happen on the same worker as the original processing attempt, as shown in the
-diagram.
+figure.
-
+
-**Figure 6:** The processing of an element within bundle B fails, and another
worker
-retries the entire bundle.
+*Figure 6: The processing of an element within bundle B fails, and another
worker
+retries the entire bundle.*
Because we encountered a failure while processing an element in the input
bundle, we had to reprocess _all_ of the elements in the input bundle. This
means
@@ -176,10 +180,12 @@ the output of `ParDo2`. Because the runner was executing
`ParDo1` and `ParDo2`
together, the output bundle from `ParDo1` must also be thrown away, and all
elements in the input bundle must be retried. These two `ParDo`s are
co-failing.
-
+
-**Figure 7:** Processing of an element within bundle D fails, so all elements
in
-the input bundle are retried.
+*Figure 7: Processing of an element within bundle D fails, so all elements in
+the input bundle are retried.*
Note that the retry does not necessarily have the same processing time as the
original attempt, as shown in the diagram.
diff --git a/src/documentation/pipelines/design-your-pipeline.md
b/src/documentation/pipelines/design-your-pipeline.md
index 87250afe1..52ae34134 100644
--- a/src/documentation/pipelines/design-your-pipeline.md
+++ b/src/documentation/pipelines/design-your-pipeline.md
@@ -24,13 +24,14 @@ When designing your Beam pipeline, consider a few basic
questions:
## A basic pipeline
-The simplest pipelines represent a linear flow of operations, as shown in
Figure 1 below:
+The simplest pipelines represent a linear flow of operations, as shown in
figure
+1.
-<figure id="fig1">
- <img src="{{ site.baseurl }}/images/design-your-pipeline-linear.png"
- alt="A linear pipeline.">
-</figure>
-Figure 1: A linear pipeline.
+
+
+*Figure 1: A linear pipeline.*
However, your pipeline can be significantly more complex. A pipeline
represents a [Directed Acyclic
Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) of steps. It can
have multiple input sources, multiple output sinks, and its operations
(`PTransform`s) can both read and output multiple `PCollection`s. The following
examples show some of the different shapes your pipeline can take.
@@ -42,13 +43,17 @@ It's important to understand that transforms do not consume
`PCollection`s; inst
You can use the same `PCollection` as input for multiple transforms without
consuming the input or altering it.
-The pipeline illustrated in Figure 2 below reads its input, first names
(Strings), from a single source, a database table, and creates a `PCollection`
of table rows. Then, the pipeline applies multiple transforms to the **same**
`PCollection`. Transform A extracts all the names in that `PCollection` that
start with the letter 'A', and Transform B extracts all the names in that
`PCollection` that start with the letter 'B'. Both transforms A and B have the
same input `PCollection`.
+The pipeline in figure 2 is a branching pipeline. The pipeline reads its input
(first names represented as strings) from a database table and creates a
`PCollection` of table rows. Then, the pipeline applies multiple transforms to
the **same** `PCollection`. Transform A extracts all the names in that
`PCollection` that start with the letter 'A', and Transform B extracts all the
names in that `PCollection` that start with the letter 'B'. Both transforms A
and B have the same input `PCollection`.
+
+
+
+*Figure 2: A branching pipeline. Two transforms are applied to a single
+PCollection of database table rows.*
+
+The following example code applies two transforms to a single input collection.
-<figure id="fig2">
- <img src="{{ site.baseurl
}}/images/design-your-pipeline-multiple-pcollections.png"
- alt="A pipeline with multiple transforms. Note that the PCollection
of table rows is processed by two transforms.">
-</figure>
-Figure 2: A pipeline with multiple transforms. Note that the PCollection of
the database table rows is processed by two transforms. See the example code
below:
```java
PCollection<String> dbRowCollection = ...;
@@ -75,15 +80,17 @@ PCollection<String> bCollection =
dbRowCollection.apply("bTrans", ParDo.of(new D
Another way to branch a pipeline is to have a **single** transform output to
multiple `PCollection`s by using [tagged outputs]({{ site.baseurl
}}/documentation/programming-guide/#additional-outputs). Transforms that
produce more than one output process each element of the input once, and output
to zero or more `PCollection`s.
-Figure 3 below illustrates the same example described above, but with one
transform that produces multiple outputs. Names that start with 'A' are added
to the main output `PCollection`, and names that start with 'B' are added to an
additional output `PCollection`.
+Figure 3 illustrates the same example described above, but with one transform
that produces multiple outputs. Names that start with 'A' are added to the main
output `PCollection`, and names that start with 'B' are added to an additional
output `PCollection`.
-<figure id="fig3">
- <img src="{{ site.baseurl
}}/images/design-your-pipeline-additional-outputs.png"
- alt="A pipeline with a transform that outputs multiple PCollections.">
-</figure>
-Figure 3: A pipeline with a transform that outputs multiple PCollections.
+
-The pipeline in Figure 2 contains two transforms that process the elements in
the same input `PCollection`. One transform uses the following logic:
+*Figure 3: A pipeline with a transform that outputs multiple PCollections.*
+
+If we compare the pipelines in figure 2 and figure 3, you can see they perform
+the same operation in different ways. The pipeline in figure 2 contains two
+transforms that process the elements in the same input `PCollection`. One
+transform uses the following logic:
<pre>if (starts with 'A') { outputToPCollectionA }</pre>
@@ -93,11 +100,15 @@ while the other transform uses:
Because each transform reads the entire input `PCollection`, each element in
the input `PCollection` is processed twice.
-The pipeline in Figure 3 performs the same operation in a different way - with
only one transform that uses the following logic:
+The pipeline in figure 3 performs the same operation in a different way - with
only one transform that uses the following logic:
<pre>if (starts with 'A') { outputToPCollectionA } else if (starts with 'B') {
outputToPCollectionB }</pre>
-where each element in the input `PCollection` is processed once. See the
example code below:
+where each element in the input `PCollection` is processed once.
+
+The following example code applies one transform that processes each element
+once and outputs two collections.
+
```java
// Define two TupleTags, one for each output.
final TupleTag<String> startsWithATag = new TupleTag<String>(){};
@@ -139,32 +150,43 @@ Often, after you've branched your `PCollection` into
multiple `PCollection`s via
* **Flatten** - You can use the `Flatten` transform in the Beam SDKs to
merge multiple `PCollection`s of the **same type**.
* **Join** - You can use the `CoGroupByKey` transform in the Beam SDK to
perform a relational join between two `PCollection`s. The `PCollection`s must
be keyed (i.e. they must be collections of key/value pairs) and they must use
the same key type.
-The example depicted in Figure 4 below is a continuation of the example
illustrated in Figure 2 in [the section
above](#multiple-transforms-process-the-same-pcollection). After branching into
two `PCollection`s, one with names that begin with 'A' and one with names that
begin with 'B', the pipeline merges the two together into a single
`PCollection` that now contains all names that begin with either 'A' or 'B'.
Here, it makes sense to use `Flatten` because the `PCollection`s being merged
both contain the same type.
+The example in figure 4 is a continuation of the example in figure 2 in [the
+section above](#multiple-transforms-process-the-same-pcollection). After
+branching into two `PCollection`s, one with names that begin with 'A' and one
+with names that begin with 'B', the pipeline merges the two together into a
+single `PCollection` that now contains all names that begin with either 'A' or
+'B'. Here, it makes sense to use `Flatten` because the `PCollection`s being
+merged both contain the same type.
+
+
+
+*Figure 4: A pipeline that merges two collections into one collection with the
Flatten
+transform.*
+
+The following example code applies `Flatten` to merge two collections.
-<figure id="fig4">
- <img src="{{ site.baseurl }}/images/design-your-pipeline-flatten.png"
- alt="Part of a pipeline that merges multiple PCollections.">
-</figure>
-Figure 4: Part of a pipeline that merges multiple PCollections. See the
example code below:
```java
//merge the two PCollections with Flatten
PCollectionList<String> collectionList =
PCollectionList.of(aCollection).and(bCollection);
PCollection<String> mergedCollectionWithFlatten = collectionList
.apply(Flatten.<String>pCollections());
-// continue with the new merged PCollection
+// continue with the new merged PCollection
mergedCollectionWithFlatten.apply(...);
```
## Multiple sources
-Your pipeline can read its input from one or more sources. If your pipeline
reads from multiple sources and the data from those sources is related, it can
be useful to join the inputs together. In the example illustrated in Figure 5
below, the pipeline reads names and addresses from a database table, and names
and order numbers from a Kafka topic. The pipeline then uses `CoGroupByKey` to
join this information, where the key is the name; the resulting `PCollection`
contains all the combinations of names, addresses, and orders.
+Your pipeline can read its input from one or more sources. If your pipeline
reads from multiple sources and the data from those sources is related, it can
be useful to join the inputs together. In the example illustrated in figure 5
below, the pipeline reads names and addresses from a database table, and names
and order numbers from a Kafka topic. The pipeline then uses `CoGroupByKey` to
join this information, where the key is the name; the resulting `PCollection`
contains all the combinations of names, addresses, and orders.
+
+
+
+*Figure 5: A pipeline that does a relational join of two input collections.*
+
+The following example code applies `Join` to join two input collections.
-<figure id="fig5">
- <img src="{{ site.baseurl }}/images/design-your-pipeline-join.png"
- alt="A pipeline with multiple input sources.">
-</figure>
-Figure 5: A pipeline with multiple input sources. See the example code below:
```java
PCollection<KV<String, String>> userAddress =
pipeline.apply(JdbcIO.<KV<String, String>>read()...);
diff --git a/src/documentation/programming-guide.md
b/src/documentation/programming-guide.md
index f746d7dc2..a817d99be 100644
--- a/src/documentation/programming-guide.md
+++ b/src/documentation/programming-guide.md
@@ -467,9 +467,13 @@ you can chain transforms to create a sequential pipeline,
like this one:
| [Third Transform])
```
-The resulting workflow graph of the above pipeline looks like this:
+The resulting workflow graph of the above pipeline looks like this.
-[Sequential Graph Graphic]
+
+
+*Figure: A linear pipeline with three sequential transforms.*
However, note that a transform *does not consume or otherwise alter* the input
collection--remember that a `PCollection` is immutable by definition. This
means
@@ -485,9 +489,14 @@ a branching pipeline, like so:
[Output PCollection 2] = [Input PCollection] | [Transform 2]
```
-The resulting workflow graph from the branching pipeline above looks like this:
+The resulting workflow graph from the branching pipeline above looks like this.
+
+
-[Branching Graph Graphic]
+*Figure: A branching pipeline. Two transforms are applied to a single
+PCollection of database table rows.*
You can also build your own [composite transforms](#composite-transforms) that
nest multiple sub-steps inside a single, larger transform. Composite transforms
diff --git a/src/get-started/beam-overview.md b/src/get-started/beam-overview.md
index e3a474a6d..3fe733ed7 100644
--- a/src/get-started/beam-overview.md
+++ b/src/get-started/beam-overview.md
@@ -20,10 +20,8 @@ The Beam SDKs provide a unified programming model that can
represent and transfo
Beam currently supports the following language-specific SDKs:
-* Java <img src="{{ site.baseurl }}/images/logos/sdks/java.png"
- alt="Java SDK">
-* Python <img src="{{ site.baseurl }}/images/logos/sdks/python.png"
- alt="Python SDK ">
+* Java 
+* Python 
## Apache Beam Pipeline Runners
@@ -31,16 +29,11 @@ The Beam Pipeline Runners translate the data processing
pipeline you define with
Beam currently supports Runners that work with the following distributed
processing back-ends:
-* Apache Apex <img src="{{ site.baseurl }}/images/logos/runners/apex.png"
- alt="Apache Apex">
-* Apache Flink <img src="{{ site.baseurl }}/images/logos/runners/flink.png"
- alt="Apache Flink">
-* Apache Gearpump (incubating) <img src="{{ site.baseurl
}}/images/logos/runners/gearpump.png"
- alt="Apache Gearpump">
-* Apache Spark <img src="{{ site.baseurl }}/images/logos/runners/spark.png"
- alt="Apache Spark">
-* Google Cloud Dataflow <img src="{{ site.baseurl
}}/images/logos/runners/dataflow.png"
- alt="Google Cloud Dataflow">
+* Apache Apex 
+* Apache Flink 
+* Apache Gearpump (incubating) 
+* Apache Spark 
+* Google Cloud Dataflow 
**Note:** You can always execute your pipeline locally for testing and
debugging purposes.
diff --git a/src/get-started/mobile-gaming-example.md
b/src/get-started/mobile-gaming-example.md
index bcc16b32d..9a734c050 100644
--- a/src/get-started/mobile-gaming-example.md
+++ b/src/get-started/mobile-gaming-example.md
@@ -38,12 +38,14 @@ When the user completes an instance of the game, their
phone sends the data even
The following diagram shows the ideal situation (events are processed as they
occur) vs. reality (there is often a time delay before processing).
-<figure id="fig1">
- <img src="{{ site.baseurl }}/images/gaming-example-basic.png"
- width="264" height="260"
- alt="Score data for three users.">
-</figure>
-**Figure 1:** The X-axis represents event time: the actual time a game event
occurred. The Y-axis represents processing time: the time at which a game event
was processed. Ideally, events should be processed as they occur, depicted by
the dotted line in the diagram. However, in reality that is not the case and it
looks more like what is depicted by the red squiggly line.
+
+
+*Figure 1: The X-axis represents event time: the actual time a game event
+occurred. The Y-axis represents processing time: the time at which a game event
+was processed. Ideally, events should be processed as they occur, depicted by
+the dotted line in the diagram. However, in reality that is not the case and it
+looks more like what is depicted by the red squiggly line above the ideal
line.*
The data events might be received by the game server significantly later than
users generate them. This time difference (called **skew**) can have processing
implications for pipelines that make calculations that consider when each score
was generated. Such pipelines might track scores generated during each hour of
a day, for example, or they calculate the length of time that users are
continuously playing the game—both of which depend on each data record's event
time.
@@ -79,14 +81,12 @@ As the pipeline processes each event, the event score gets
added to the sum tota
2. Sum the score values for each unique user by grouping each game event by
user ID and combining the score values to get the total score for that
particular user.
3. Write the result data to a text file.
-The following diagram shows score data for several users over the pipeline
analysis period. In the diagram, each data point is an event that results in
one user/score pair:
+The following diagram shows score data for several users over the pipeline
analysis period. In the diagram, each data point is an event that results in
one user/score pair.
+
+{: width="850px"}
-<figure id="fig2">
- <img src="{{ site.baseurl }}/images/gaming-example.gif"
- width="900" height="263"
- alt="Score data for three users.">
-</figure>
-**Figure 2:** Score data for three users.
+*Figure 2: Score data for three users.*
This example uses batch processing, and the diagram's Y axis represents
processing time: the pipeline processes events lower on the Y-axis first, and
events higher up the axis later. The diagram's X axis represents the event time
for each game event, as denoted by that event's timestamp. Note that the
individual events in the diagram are not processed by the pipeline in the same
order as they occurred (according to their timestamps).
@@ -152,12 +152,11 @@ Using fixed-time windowing lets the pipeline provide
better information on how e
The following diagram shows how the pipeline processes a day's worth of a
single team's scoring data after applying fixed-time windowing:
-<figure id="fig3">
- <img src="{{ site.baseurl }}/images/gaming-example-team-scores-narrow.gif"
- width="900" height="390"
- alt="Score data for two teams.">
-</figure>
-**Figure 3:** Score data for two teams. Each team's scores are divided into
logical windows based on when those scores occurred in event time.
+{: width="800px"}
+
+*Figure 3: Score data for two teams. Each team's scores are divided into
+logical windows based on when those scores occurred in event time.*
Notice that as processing time advances, the sums are now _per window_; each
window represents an hour of _event time_ during the day in which the scores
occurred.
@@ -250,12 +249,12 @@ Because we want all the data that has arrived in the
pipeline every time we upda
When we specify a ten-minute processing time trigger for the single global
window, the pipeline effectively takes a "snapshot" of the contents of the
window every time the trigger fires. This snapshot happens after ten minutes
have passed since data was received. If no data has arrived, the pipeline takes
its next "snapshot" 10 minutes after an element arrives. Since we're using a
single global window, each snapshot contains all the data collected _to that
point in time_. The following diagram shows the effects of using a processing
time trigger on the single global window:
-<figure id="fig4">
- <img src="{{ site.baseurl }}/images/gaming-example-proc-time-narrow.gif"
- width="900" height="263"
- alt="Score data for three users.">
-</figure>
-**Figure 4:** Score data for three users. Each user's scores are grouped
together in a single global window, with a trigger that generates a snapshot
for output ten minutes after data is received.
+{: width="850px"}
+
+*Figure 4: Score data for three users. Each user's scores are grouped together
+in a single global window, with a trigger that generates a snapshot for output
+ten minutes after data is received.*
As processing time advances and more scores are processed, the trigger outputs
the updated sum for each user.
@@ -282,12 +281,12 @@ In an ideal world, all data would be processed
immediately when it occurs, so th
The following diagram shows the relationship between ongoing processing time
and each score's event time for two teams:
-<figure id="fig5">
- <img src="{{ site.baseurl }}/images/gaming-example-event-time-narrow.gif"
- width="900" height="390"
- alt="Score data by team, windowed by event time.">
-</figure>
-**Figure 5:** Score data by team, windowed by event time. A trigger based on
processing time causes the window to emit speculative early results and include
late results.
+{: width="800px"}
+
+*Figure 5: Score data by team, windowed by event time. A trigger based on
+processing time causes the window to emit speculative early results and include
+late results.*
The dotted line in the diagram is the "ideal" **watermark**: Beam's notion of
when all data in a given window can reasonably be considered to have arrived.
The irregular solid line represents the actual watermark, as determined by the
data source.
@@ -359,13 +358,12 @@ When you set session windowing, you specify a _minimum
gap duration_ between eve
The following diagram shows how data might look when grouped into session
windows. Unlike fixed windows, session windows are _different for each user_
and is dependent on each individual user's play pattern:
-<figure id="fig6">
- <img src="{{ site.baseurl }}/images/gaming-example-session-windows.png"
- width="662" height="521"
- alt="A diagram representing session windowing."
- alt="User sessions, with a minimum gap duration.">
-</figure>
-**Figure 6:** User sessions, with a minimum gap duration. Note how each user
has different sessions, according to how many instances they play and how long
their breaks between instances are.
+
+
+*Figure 6: User sessions with a minimum gap duration. Each user has different
+sessions, according to how many instances they play and how long their breaks
+between instances are.*
We can use the session-windowed data to determine the average length of
uninterrupted play time for all of our users, as well as the total score they
achieve during each session. We can do this in the code by first applying
session windows, summing the score per user and session, and then using a
transform to calculate the length of each individual session:
diff --git a/src/get-started/wordcount-example.md
b/src/get-started/wordcount-example.md
index 82d64b03b..408ce5b71 100644
--- a/src/get-started/wordcount-example.md
+++ b/src/get-started/wordcount-example.md
@@ -26,21 +26,21 @@ four successively more detailed WordCount examples that
build on each other. The
input text for all the examples is a set of Shakespeare's texts.
Each WordCount example introduces different concepts in the Beam programming
-model. Begin by understanding Minimal WordCount, the simplest of the examples.
+model. Begin by understanding MinimalWordCount, the simplest of the examples.
Once you feel comfortable with the basic principles in building a pipeline,
continue on to learn more concepts in the other examples.
-* **Minimal WordCount** demonstrates the basic principles involved in building
a
+* **MinimalWordCount** demonstrates the basic principles involved in building a
pipeline.
* **WordCount** introduces some of the more common best practices in creating
re-usable and maintainable pipelines.
-* **Debugging WordCount** introduces logging and debugging practices.
-* **Windowed WordCount** demonstrates how you can use Beam's programming model
+* **DebuggingWordCount** introduces logging and debugging practices.
+* **WindowedWordCount** demonstrates how you can use Beam's programming model
to handle both bounded and unbounded datasets.
## MinimalWordCount example
-Minimal WordCount demonstrates a simple pipeline that can read from a text
file,
+MinimalWordCount demonstrates a simple pipeline that can read from a text file,
apply transforms to tokenize and count the words, and write the data to an
output text file. This example hard-codes the locations for its input and
output
files and doesn't perform any error checking; it is intended to only show you
@@ -110,7 +110,7 @@ To view the full code in Python, see
* Running the Pipeline
The following sections explain these concepts in detail, using the relevant
code
-excerpts from the Minimal WordCount pipeline.
+excerpts from the MinimalWordCount pipeline.
### Creating the pipeline
@@ -160,7 +160,7 @@ Pipeline p = Pipeline.create(options);
### Applying pipeline transforms
-The Minimal WordCount pipeline contains several transforms to read data into
the
+The MinimalWordCount pipeline contains several transforms to read data into the
pipeline, manipulate or otherwise transform the data, and write out the
results.
Transforms can consist of an individual operation, or can contain multiple
nested transforms (which is a [composite transform]({{ site.baseurl
}}/documentation/programming-guide#composite-transforms)).
@@ -170,10 +170,12 @@ input and output data is often represented by the SDK
class `PCollection`.
`PCollection` is a special class, provided by the Beam SDK, that you can use to
represent a data set of virtually any size, including unbounded data sets.
-<img src="{{ "/images/wordcount-pipeline.png" | prepend: site.baseurl }}"
alt="Word Count pipeline diagram">
-Figure 1: The pipeline data flow.
+{:
width="800px"}
-The Minimal WordCount pipeline contains five transforms:
+*Figure 1: The MinimalWordCount pipeline data flow.*
+
+The MinimalWordCount pipeline contains five transforms:
1. A text file `Read` transform is applied to the `Pipeline` object itself,
and
produces a `PCollection` as output. Each element in the output
`PCollection`
@@ -298,7 +300,7 @@ your pipeline, and help make your pipeline's code reusable.
This section assumes that you have a good understanding of the basic concepts
in
building a pipeline. If you feel that you aren't at that point yet, read the
-above section, [Minimal WordCount](#minimalwordcount-example).
+above section, [MinimalWordCount](#minimalwordcount-example).
**To run this example in Java:**
@@ -402,7 +404,7 @@ When using `ParDo` transforms, you need to specify the
processing operation that
gets applied to each element in the input `PCollection`. This processing
operation is a subclass of the SDK class `DoFn`. You can create the `DoFn`
subclasses for each `ParDo` inline, as an anonymous inner class instance, as is
-done in the previous example (Minimal WordCount). However, it's often a good
+done in the previous example (MinimalWordCount). However, it's often a good
idea to define the `DoFn` at the global level, which makes it easier to unit
test and can make the `ParDo` code more readable.
@@ -502,9 +504,9 @@ public static void main(String[] args) {
{% github_sample
/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/snippets.py
tag:examples_wordcount_wordcount_options
%}```
-## Debugging WordCount example
+## DebuggingWordCount example
-The Debugging WordCount example demonstrates some best practices for
+The DebuggingWordCount example demonstrates some best practices for
instrumenting your pipeline code.
**To run this example in Java:**
@@ -710,7 +712,7 @@ public static void main(String[] args) {
## WindowedWordCount example
-This example, `WindowedWordCount`, counts words in text just as the previous
+The WindowedWordCount example counts words in text just as the previous
examples did, but introduces several advanced concepts.
**New Concepts:**
@@ -866,7 +868,7 @@ bounded sets of elements. PTransforms that aggregate
multiple elements process
each `PCollection` as a succession of multiple, finite windows, even though the
entire collection itself may be of infinite size (unbounded).
-The `WindowedWordCount` example applies fixed-time windowing, wherein each
+The WindowedWordCount example applies fixed-time windowing, wherein each
window represents a fixed time interval. The fixed window size for this example
defaults to 1 minute (you can change this with a command-line option).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Images incorrectly rendering/missing
> ------------------------------------
>
> Key: BEAM-3374
> URL: https://issues.apache.org/jira/browse/BEAM-3374
> Project: Beam
> Issue Type: Bug
> Components: website
> Reporter: Melissa Pashniak
> Assignee: Melissa Pashniak
> Priority: Minor
>
> For example, mobile gaming has some stretched images, and a couple images in
> the programming guide are missing.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)