TheNeuralBit commented on a change in pull request #13565:
URL: https://github.com/apache/beam/pull/13565#discussion_r569559285
##########
File path: website/www/site/content/en/blog/adding-data-sources-to-sql.md
##########
@@ -81,7 +81,7 @@ The `TableProvider` classes are under
Our table provider looks like this:
-{{< highlight java >}}
+{{< highlight language="java" >}}
Review comment:
Is it not possible to make the original `highlight java` syntax work?
That would be preferable to changing the short codes everywhere.
##########
File path: website/www/site/content/en/contribute/practises.md
##########
@@ -0,0 +1,66 @@
+---
+title: "Beam community’s practises"
+layout: "arrow_template"
+---
+
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Beam community’s practises
Review comment:
nit: I think we should use the American English "practices" throughout
this page (and in the title)
##########
File path: website/www/site/content/en/contribute/ptransform-style-guide.md
##########
@@ -230,76 +234,76 @@ E.g. if you want to return a `PCollection<Foo>` and a
`PCollection<Bar>`, expose
For example:
-{{< highlight java >}}
+{{< highlight language="java" >}}
public class MyTransform extends PTransform<..., PCollectionTuple> {
- private final TupleTag<Moo> mooTag = new TupleTag<Moo>() {};
- private final TupleTag<Blah> blahTag = new TupleTag<Blah>() {};
- ...
- PCollectionTuple expand(... input) {
- ...
- PCollection<Moo> moo = ...;
- PCollection<Blah> blah = ...;
- return PCollectionTuple.of(mooTag, moo)
- .and(blahTag, blah);
- }
+private final TupleTag<Moo> mooTag = new TupleTag<Moo>() {};
+private final TupleTag<Blah> blahTag = new TupleTag<Blah>() {};
+...
+PCollectionTuple expand(... input) {
+...
+PCollection<Moo> moo = ...;
+PCollection<Blah> blah = ...;
+return PCollectionTuple.of(mooTag, moo)
+.and(blahTag, blah);
+}
- public TupleTag<Moo> getMooTag() {
- return mooTag;
- }
+public TupleTag<Moo> getMooTag() {
+return mooTag;
+}
- public TupleTag<Blah> getBlahTag() {
- return blahTag;
- }
- ...
Review comment:
It looks like there are still some cases where leading whitespace in
code snippets has been remvoed. Please revert these
##########
File path: website/www/site/content/en/contribute/ptransform-style-guide.md
##########
@@ -24,189 +25,192 @@ _A style guide for writers of new reusable PTransforms._
## Language-neutral considerations
### Consistency
+
Be consistent with prior art:
-* Please read the [contribution guide](/contribute/).
-* If there is already a similar transform in some SDK, make the API of your
transform similar, so that users' experience with one of them will transfer to
the other. This applies to transforms in the same-language SDK and
different-language SDKs.
-*Exception:* pre-existing transforms that clearly violate the current style
guide for the sole reason that they were developed before this guide was
ratified. In this case, the style guide takes priority over consistency with
the existing transform.
-* When there is no existing similar transform, stay within what is idiomatic
within your language of choice (e.g. Java or Python).
+- Please read the [contribution guide](/contribute/).
+- If there is already a similar transform in some SDK, make the API of your
transform similar, so that users' experience with one of them will transfer to
the other. This applies to transforms in the same-language SDK and
different-language SDKs.
+ _Exception:_ pre-existing transforms that clearly violate the current style
guide for the sole reason that they were developed before this guide was
ratified. In this case, the style guide takes priority over consistency with
the existing transform.
+- When there is no existing similar transform, stay within what is idiomatic
within your language of choice (e.g. Java or Python).
### Exposing a PTransform vs. something else
So you want to develop a library that people will use in their Beam pipelines
- a connector to a third-party system, a machine learning algorithm, etc. How
should you expose it?
Do:
-* Expose every major data-parallel task accomplished by your library as a
composite `PTransform`. This allows the structure of the transform to evolve
transparently to the code that uses it: e.g. something that started as a
`ParDo` can become a more complex transform over time.
-* Expose large, non-trivial, reusable sequential bits of the transform's code,
which others might want to reuse in ways you haven't anticipated, as a regular
function or class library. The transform should simply wire this logic
together. As a side benefit, you can unit-test those functions and classes
independently.
-*Example:* when developing a transform that parses files in a custom data
format, expose the format parser as a library; likewise for a transform that
implements a complex machine learning algorithm, etc.
-* In some cases, this may include Beam-specific classes, such as `CombineFn`,
or nontrivial `DoFn`s (those that are more than just a single `@ProcessElement`
method).
-As a rule of thumb: expose these if you anticipate that the full packaged
`PTransform` may be insufficient for a user's needs and the user may want to
reuse the lower-level primitive.
+- Expose every major data-parallel task accomplished by your library as a
composite `PTransform`. This allows the structure of the transform to evolve
transparently to the code that uses it: e.g. something that started as a
`ParDo` can become a more complex transform over time.
+- Expose large, non-trivial, reusable sequential bits of the transform's code,
which others might want to reuse in ways you haven't anticipated, as a regular
function or class library. The transform should simply wire this logic
together. As a side benefit, you can unit-test those functions and classes
independently.
+ _Example:_ when developing a transform that parses files in a custom data
format, expose the format parser as a library; likewise for a transform that
implements a complex machine learning algorithm, etc.
+- In some cases, this may include Beam-specific classes, such as `CombineFn`,
or nontrivial `DoFn`s (those that are more than just a single `@ProcessElement`
method).
+ As a rule of thumb: expose these if you anticipate that the full packaged
`PTransform` may be insufficient for a user's needs and the user may want to
reuse the lower-level primitive.
Do not:
-* Do not expose the exact way the transform is internally structured. E.g.:
the public API surface of your library *usually* (with exception of the last
bullet above) should not expose `DoFn`, concrete `Source` or `Sink` classes,
etc., in order to avoid presenting users with a confusing choice between
applying the `PTransform` or using the `DoFn`/`Source`/`Sink`.
+- Do not expose the exact way the transform is internally structured. E.g.:
the public API surface of your library _usually_ (with exception of the last
bullet above) should not expose `DoFn`, concrete `Source` or `Sink` classes,
etc., in order to avoid presenting users with a confusing choice between
applying the `PTransform` or using the `DoFn`/`Source`/`Sink`.
### Naming
Do:
-* Respect language-specific naming conventions, e.g. name classes in
`PascalCase` in Java and Python, functions in `camelCase` in Java but
`snake_case` in Python, etc.
-* Name factory functions so that either the function name is a verb, or
referring to the transform reads like a verb: e.g. `MongoDbIO.read()`,
`Flatten.iterables()`.
-* In typed languages, name `PTransform` classes also like verbs (e.g.:
`MongoDbIO.Read`, `Flatten.Iterables`).
-* Name families of transforms for interacting with a storage system using the
word "IO": `MongoDbIO`, `JdbcIO`.
+- Respect language-specific naming conventions, e.g. name classes in
`PascalCase` in Java and Python, functions in `camelCase` in Java but
`snake_case` in Python, etc.
+- Name factory functions so that either the function name is a verb, or
referring to the transform reads like a verb: e.g. `MongoDbIO.read()`,
`Flatten.iterables()`.
+- In typed languages, name `PTransform` classes also like verbs (e.g.:
`MongoDbIO.Read`, `Flatten.Iterables`).
+- Name families of transforms for interacting with a storage system using the
word "IO": `MongoDbIO`, `JdbcIO`.
Do not:
-* Do not use words `transform`, `source`, `sink`, `reader`, `writer`, `bound`,
`unbound` in `PTransform` class names (note: `bounded` and `unbounded` are fine
when referring to whether a `PCollection` is bounded or unbounded): these words
are redundant, confusing, obsolete, or name an existing different concept in
the SDK.
+- Do not use words `transform`, `source`, `sink`, `reader`, `writer`, `bound`,
`unbound` in `PTransform` class names (note: `bounded` and `unbounded` are fine
when referring to whether a `PCollection` is bounded or unbounded): these words
are redundant, confusing, obsolete, or name an existing different concept in
the SDK.
### Configuration
#### What goes into configuration vs. input collection
-* **Into input `PCollection`:** anything of which there may be a very large
number of instances (if there can be >1000 of it, it should be in a
`PCollection`), or which is potentially not known at pipeline construction time.
-E.g.: records to be processed or written to a third-party system; filenames to
be read.
-Exception: sometimes Beam APIs require things to be known at pipeline
construction time - e.g. the `Bounded`/`UnboundedSource` API. If you absolutely
have to use such an API, its input can of course go only into transform
configuration.
-* **Into transform configuration:** what is constant throughout the transform
(including `ValueProvider`s) and does not depend on the contents of the
transform's input `PCollection`s.
-E.g.: a database query or connection string; credentials; a user-specified
callback; a tuning parameter.
-One advantage of putting a parameter into transform configuration is, it can
be validated at pipeline construction time.
+- **Into input `PCollection`:** anything of which there may be a very large
number of instances (if there can be >1000 of it, it should be in a
`PCollection`), or which is potentially not known at pipeline construction time.
+ E.g.: records to be processed or written to a third-party system; filenames
to be read.
+ Exception: sometimes Beam APIs require things to be known at pipeline
construction time - e.g. the `Bounded`/`UnboundedSource` API. If you absolutely
have to use such an API, its input can of course go only into transform
configuration.
+- **Into transform configuration:** what is constant throughout the transform
(including `ValueProvider`s) and does not depend on the contents of the
transform's input `PCollection`s.
+ E.g.: a database query or connection string; credentials; a user-specified
callback; a tuning parameter.
+ One advantage of putting a parameter into transform configuration is, it can
be validated at pipeline construction time.
#### What parameters to expose
Do:
-* **Expose** parameters that are necessary to compute the output.
+- **Expose** parameters that are necessary to compute the output.
Do not:
-* **Do not expose** tuning knobs, such as batch sizes, connection pool sizes,
unless it's impossible to automatically supply or compute a good-enough value
(i.e., unless you can imagine a reasonable person reporting a bug about the
absence of this knob).
-* When developing a connector to a library that has many parameters, **do not
mirror each parameter** of the underlying library - if necessary, reuse the
underlying library's configuration class and let user supply a whole instance.
Example: `JdbcIO`.
-*Exception 1:* if some parameters of the underlying library interact with Beam
semantics non-trivially, then expose them. E.g. when developing a connector to
a pub/sub system that has a "delivery guarantee" parameter for publishers,
expose the parameter but prohibit values incompatible with the Beam model
(at-most-once and exactly-once).
-*Exception 2:* if the underlying library's configuration class is cumbersome
to use - e.g. does not declare a stable API, exposes problematic transitive
dependencies, or does not obey [semantic versioning](https://semver.org/) - in
this case, it is better to wrap it and expose a cleaner and more stable API to
users of the transform.
+- **Do not expose** tuning knobs, such as batch sizes, connection pool sizes,
unless it's impossible to automatically supply or compute a good-enough value
(i.e., unless you can imagine a reasonable person reporting a bug about the
absence of this knob).
+- When developing a connector to a library that has many parameters, **do not
mirror each parameter** of the underlying library - if necessary, reuse the
underlying library's configuration class and let user supply a whole instance.
Example: `JdbcIO`.
+ _Exception 1:_ if some parameters of the underlying library interact with
Beam semantics non-trivially, then expose them. E.g. when developing a
connector to a pub/sub system that has a "delivery guarantee" parameter for
publishers, expose the parameter but prohibit values incompatible with the Beam
model (at-most-once and exactly-once).
+ _Exception 2:_ if the underlying library's configuration class is cumbersome
to use - e.g. does not declare a stable API, exposes problematic transitive
dependencies, or does not obey [semantic versioning](https://semver.org/) - in
this case, it is better to wrap it and expose a cleaner and more stable API to
users of the transform.
### Error handling
#### Transform configuration errors
Detect errors early. Errors can be detected at the following stages:
-* (in a compiled language) compilation of the source code of a user's pipeline
-* constructing or setting up the transform
-* applying the transform in a pipeline
-* running the pipeline
+- (in a compiled language) compilation of the source code of a user's pipeline
+- constructing or setting up the transform
+- applying the transform in a pipeline
+- running the pipeline
For example:
-* In a typed language, take advantage of compile-time error checking by making
the API of the transform strongly-typed:
- * **Strongly-typed configuration:** e.g. in Java, a parameter that is a
URL should use the `URL` class, rather than the `String` class.
- * **Strongly-typed input and output:** e.g. a transform that writes to
Mongo DB should take a `PCollection<Document>` rather than
`PCollection<String>` (assuming it is possible to provide a `Coder` for
`Document`).
-* Detect invalid values of individual parameters in setter methods.
-* Detect invalid combinations of parameters in the transform's validate method.
+- In a typed language, take advantage of compile-time error checking by making
the API of the transform strongly-typed:
+ - **Strongly-typed configuration:** e.g. in Java, a parameter that is a URL
should use the `URL` class, rather than the `String` class.
+ - **Strongly-typed input and output:** e.g. a transform that writes to Mongo
DB should take a `PCollection<Document>` rather than `PCollection<String>`
(assuming it is possible to provide a `Coder` for `Document`).
+- Detect invalid values of individual parameters in setter methods.
+- Detect invalid combinations of parameters in the transform's validate method.
#### Runtime errors and data consistency
Favor data consistency above everything else. Do not mask data loss or
corruption. If data loss can't be prevented, fail.
Do:
-* In a `DoFn`, retry transient failures if the operation is likely to succeed
on retry. Perform such retries at the narrowest scope possible in order to
minimize the amount of retried work (i.e. ideally at the level of the RPC
library itself, or at the level of directly sending the failing RPC to a
third-party system). Otherwise, let the runner retry work at the appropriate
level of granularity for you (different runners may have different retry
behavior, but most of them do *some* retrying).
-* If the transform has side effects, strive to make them idempotent (i.e. safe
to apply multiple times). Due to retries, the side effects may be executed
multiple times, possibly in parallel.
-* If the transform can have unprocessable (permanently failing) records and
you want the pipeline to proceed despite that:
- * If bad records are safe to ignore, count the bad records in a metric.
Make sure the transform's documentation mentions this aggregator. Beware that
there is no programmatic access to reading the aggregator value from inside the
pipeline during execution.
- * If bad records may need manual inspection by the user, emit them into an
output that contains only those records.
- * Alternatively take a (default zero) threshold above which element
failures become bundle failures (structure the transform to count the total
number of elements and of failed elements, compare them and fail if failures
are above the threshold).
-* If the user requests a higher data consistency guarantee than you're able to
provide, fail. E.g.: if a user requests QoS 2 (exactly-once delivery) from an
MQTT connector, the connector should fail since Beam runners may retry writing
to the connector and hence exactly-once delivery can't be done.
+- In a `DoFn`, retry transient failures if the operation is likely to succeed
on retry. Perform such retries at the narrowest scope possible in order to
minimize the amount of retried work (i.e. ideally at the level of the RPC
library itself, or at the level of directly sending the failing RPC to a
third-party system). Otherwise, let the runner retry work at the appropriate
level of granularity for you (different runners may have different retry
behavior, but most of them do _some_ retrying).
+- If the transform has side effects, strive to make them idempotent (i.e. safe
to apply multiple times). Due to retries, the side effects may be executed
multiple times, possibly in parallel.
+- If the transform can have unprocessable (permanently failing) records and
you want the pipeline to proceed despite that:
+ - If bad records are safe to ignore, count the bad records in a metric. Make
sure the transform's documentation mentions this aggregator. Beware that there
is no programmatic access to reading the aggregator value from inside the
pipeline during execution.
+ - If bad records may need manual inspection by the user, emit them into an
output that contains only those records.
+ - Alternatively take a (default zero) threshold above which element failures
become bundle failures (structure the transform to count the total number of
elements and of failed elements, compare them and fail if failures are above
the threshold).
+- If the user requests a higher data consistency guarantee than you're able to
provide, fail. E.g.: if a user requests QoS 2 (exactly-once delivery) from an
MQTT connector, the connector should fail since Beam runners may retry writing
to the connector and hence exactly-once delivery can't be done.
Do not:
-* If you can't handle a failure, don't even catch it.
-*Exception: *It may be valuable to catch the error, log a message, and rethrow
it, if you're able to provide valuable context that the original error doesn't
have.
-* Never, ever, ever do this:
-`catch(...) { log an error; return null or false or otherwise ignore; }`
-**Rule of thumb: if a bundle didn't fail, its output must be correct and
complete.**
-For a user, a transform that logged an error but succeeded is silent data loss.
+- If you can't handle a failure, don't even catch it.
+ *Exception: *It may be valuable to catch the error, log a message, and
rethrow it, if you're able to provide valuable context that the original error
doesn't have.
+- Never, ever, ever do this:
+ `catch(...) { log an error; return null or false or otherwise ignore; }`
+ **Rule of thumb: if a bundle didn't fail, its output must be correct and
complete.**
+ For a user, a transform that logged an error but succeeded is silent data
loss.
### Performance
Many runners optimize chains of `ParDo`s in ways that improve performance if
the `ParDo`s emit a small to moderate number of elements per input element, or
have relatively cheap per-element processing (e.g. Dataflow's "fusion"), but
limit parallelization if these assumptions are violated. In that case you may
need a "fusion break" (`Reshuffle.of()`) to improve the parallelizability of
processing the output `PCollection` of the `ParDo`.
-* If the transform includes a `ParDo` that outputs a potentially large number
of elements per input element, apply a fusion break after this `ParDo` to make
sure downstream transforms can process its output in parallel.
-* If the transform includes a `ParDo` that takes a very long time to process
an element, insert a fusion break before this `ParDo` to make sure all or most
elements can be processed in parallel regardless of how its input `PCollection`
was produced.
+- If the transform includes a `ParDo` that outputs a potentially large number
of elements per input element, apply a fusion break after this `ParDo` to make
sure downstream transforms can process its output in parallel.
+- If the transform includes a `ParDo` that takes a very long time to process
an element, insert a fusion break before this `ParDo` to make sure all or most
elements can be processed in parallel regardless of how its input `PCollection`
was produced.
### Documentation
Document how to configure the transform (give code examples), and what
guarantees it expects about its input or provides about its output, accounting
for the Beam model. E.g.:
-* Are the input and output collections of this transform bounded or unbounded,
or can it work with either?
-* If the transform writes data to a third-party system, does it guarantee that
data will be written at least once? at most once? exactly once? (how does it
achieve exactly-once in case the runner executes a bundle multiple times due to
retries or speculative execution a.k.a. backups?)
-* If the transform reads data from a third-party system, what's the maximum
potential degree of parallelism of the read? E.g., if the transform reads data
sequentially (e.g. executes a single SQL query), documentation should mention
that.
-* If the transform is querying an external system during processing (e.g.
joining a `PCollection` with information from an external key-value store),
what are the guarantees on freshness of queried data: e.g. is it all loaded at
the beginning of the transform, or queried per-element (in that case, what if
data for a single element changes while the transform runs)?
-* If there's a non-trivial relationship between arrival of items in the input
`PCollection` and emitting output into the output `PCollection`, what is this
relationship? (e.g. if the transform internally does windowing, triggering,
grouping, or uses the state or timers API)
+- Are the input and output collections of this transform bounded or unbounded,
or can it work with either?
+- If the transform writes data to a third-party system, does it guarantee that
data will be written at least once? at most once? exactly once? (how does it
achieve exactly-once in case the runner executes a bundle multiple times due to
retries or speculative execution a.k.a. backups?)
+- If the transform reads data from a third-party system, what's the maximum
potential degree of parallelism of the read? E.g., if the transform reads data
sequentially (e.g. executes a single SQL query), documentation should mention
that.
+- If the transform is querying an external system during processing (e.g.
joining a `PCollection` with information from an external key-value store),
what are the guarantees on freshness of queried data: e.g. is it all loaded at
the beginning of the transform, or queried per-element (in that case, what if
data for a single element changes while the transform runs)?
+- If there's a non-trivial relationship between arrival of items in the input
`PCollection` and emitting output into the output `PCollection`, what is this
relationship? (e.g. if the transform internally does windowing, triggering,
grouping, or uses the state or timers API)
### Logging
Anticipate abnormal situations that a user of the transform may run into. Log
information that they would have found sufficient for debugging, but limit the
volume of logging. Here is some advice that applies to all programs, but is
especially important when data volume is massive and execution is distributed.
Do:
-* When handling an error from a third-party system, log the full error with
any error details the third-party system provides about it, and include any
additional context the transform knows. This enables the user to take action
based on the information provided in the message. When handling an exception
and rethrowing your own exception, wrap the original exception in it (some
languages offer more advanced facilities, e.g. Java's "suppressed exceptions").
Never silently drop available information about an error.
-* When performing a rare (not per-element) and slow operation (e.g. expanding
a large file-pattern, or initiating an import/export job), log when the
operation begins and ends. If the operation has an identifier, log the
identifier, so the user can look up the operation for later debugging.
-* When computing something low-volume that is critically important for
correctness or performance of further processing, log the input and output, so
a user in the process of debugging can sanity-check them or reproduce an
abnormal result manually.
-E.g. when expanding a filepattern into files, log what the filepattern was and
how many parts it was split into; when executing a query, log the query and log
how many results it produced.
+- When handling an error from a third-party system, log the full error with
any error details the third-party system provides about it, and include any
additional context the transform knows. This enables the user to take action
based on the information provided in the message. When handling an exception
and rethrowing your own exception, wrap the original exception in it (some
languages offer more advanced facilities, e.g. Java's "suppressed exceptions").
Never silently drop available information about an error.
+- When performing a rare (not per-element) and slow operation (e.g. expanding
a large file-pattern, or initiating an import/export job), log when the
operation begins and ends. If the operation has an identifier, log the
identifier, so the user can look up the operation for later debugging.
+- When computing something low-volume that is critically important for
correctness or performance of further processing, log the input and output, so
a user in the process of debugging can sanity-check them or reproduce an
abnormal result manually.
+ E.g. when expanding a filepattern into files, log what the filepattern was
and how many parts it was split into; when executing a query, log the query and
log how many results it produced.
Do not:
-* Do not log at `INFO` per element or per bundle. `DEBUG`/`TRACE` may be okay
because these levels are disabled by default.
-* Avoid logging data payloads that may contain sensitive information, or
sanitize them before logging (e.g. user data, credentials, etc).
+- Do not log at `INFO` per element or per bundle. `DEBUG`/`TRACE` may be okay
because these levels are disabled by default.
+- Avoid logging data payloads that may contain sensitive information, or
sanitize them before logging (e.g. user data, credentials, etc).
### Testing
Data processing is tricky, full of corner cases, and difficult to debug,
because pipelines take a long time to run, it's hard to check if the output is
correct, you can't attach a debugger, and you often can't log as much as you
wish to, due to high volume of data. Because of that, testing is particularly
important.
#### Testing the transform's run-time behavior
-* Unit-test the overall semantics of the transform using `TestPipeline` and
`PAssert`. Start with testing against the direct runner. Assertions on
`PCollection` contents should be strict: e.g. when a read from a database is
expected to read the numbers 1 through 10, assert not just that there are 10
elements in the output `PCollection`, or that each element is in the range [1,
10] - but assert that each number 1 through 10 appears exactly once.
-* Identify non-trivial sequential logic in the transform that is prone to
corner cases which are difficult to reliably simulate using a `TestPipeline`,
extract this logic into unit-testable functions, and unit-test them. Common
corner cases are:
- * `DoFn`s processing empty bundles
- * `DoFn`s processing extremely large bundles (contents doesn't fit in
memory, including "hot keys" with a very large number of values)
- * Third-party APIs failing
- * Third-party APIs providing wildly inaccurate information
- * Leaks of `Closeable`/`AutoCloseable` resources in failure cases
- * Common corner cases when developing sources: complicated arithmetic in
`BoundedSource.split` (e.g. splitting key or offset ranges), iteration over
empty data sources or composite data sources that have some empty components.
-* Mock out the interactions with third-party systems, or better, use
["fake"](https://martinfowler.com/articles/mocksArentStubs.html)
implementations when available. Make sure that the mocked-out interactions are
representative of all interesting cases of the actual behavior of these systems.
-* To unit test `DoFn`s, `CombineFn`s, and `BoundedSource`s, consider using
`DoFnTester`, `CombineFnTester`, and `SourceTestUtils` respectively which can
exercise the code in non-trivial ways to flesh out potential bugs.
-* For transforms that work over unbounded collections, test their behavior in
the presence of late or out-of-order data using `TestStream`.
-* Tests must pass 100% of the time, including in hostile, CPU- or
network-constrained environments (continuous integration servers). Never put
timing-dependent code (e.g. sleeps) into tests. Experience shows that no
reasonable amount of sleeping is enough - code can be suspended for more than
several seconds.
-* For detailed instructions on test code organization, see the [Beam Testing
Guide](https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide).
+- Unit-test the overall semantics of the transform using `TestPipeline` and
`PAssert`. Start with testing against the direct runner. Assertions on
`PCollection` contents should be strict: e.g. when a read from a database is
expected to read the numbers 1 through 10, assert not just that there are 10
elements in the output `PCollection`, or that each element is in the range [1,
10] - but assert that each number 1 through 10 appears exactly once.
+- Identify non-trivial sequential logic in the transform that is prone to
corner cases which are difficult to reliably simulate using a `TestPipeline`,
extract this logic into unit-testable functions, and unit-test them. Common
corner cases are:
+ - `DoFn`s processing empty bundles
+ - `DoFn`s processing extremely large bundles (contents doesn't fit in
memory, including "hot keys" with a very large number of values)
+ - Third-party APIs failing
+ - Third-party APIs providing wildly inaccurate information
+ - Leaks of `Closeable`/`AutoCloseable` resources in failure cases
+ - Common corner cases when developing sources: complicated arithmetic in
`BoundedSource.split` (e.g. splitting key or offset ranges), iteration over
empty data sources or composite data sources that have some empty components.
+- Mock out the interactions with third-party systems, or better, use
["fake"](https://martinfowler.com/articles/mocksArentStubs.html)
implementations when available. Make sure that the mocked-out interactions are
representative of all interesting cases of the actual behavior of these systems.
+- To unit test `DoFn`s, `CombineFn`s, and `BoundedSource`s, consider using
`DoFnTester`, `CombineFnTester`, and `SourceTestUtils` respectively which can
exercise the code in non-trivial ways to flesh out potential bugs.
+- For transforms that work over unbounded collections, test their behavior in
the presence of late or out-of-order data using `TestStream`.
+- Tests must pass 100% of the time, including in hostile, CPU- or
network-constrained environments (continuous integration servers). Never put
timing-dependent code (e.g. sleeps) into tests. Experience shows that no
reasonable amount of sleeping is enough - code can be suspended for more than
several seconds.
+- For detailed instructions on test code organization, see the [Beam Testing
Guide](https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide).
#### Testing transform construction and validation
The code for constructing and validating a transform is usually trivial and
mostly boilerplate. However, minor mistakes or typos in it can have serious
consequences (e.g. ignoring a property that the user has set), so it needs to
be tested as well. Yet, an excessive amount of trivial tests can be hard to
maintain and give a false impression that the transform is well-tested.
Do:
-* Test non-trivial validation code, where missing/incorrect/uninformative
validation may lead to serious problems: data loss, counter-intuitive behavior,
value of a property being silently ignored, or other hard-to-debug errors.
Create 1 test per non-trivial class of validation error. Some examples of
validation that should be tested:
- * If properties `withFoo()` and `withBar()` cannot both be specified at
the same time, test that a transform specifying both of them is rejected,
rather than one of the properties being silently ignored at runtime.
- * If the transform is known to behave incorrectly or counter-intuitively
for a particular configuration, test that this configuration is rejected,
rather than producing wrong results at runtime. For example, a transform might
work properly only for bounded collections, or only for globally-windowed
collections. Or, suppose a streaming system supports several levels of "quality
of service", one of which is "exactly once delivery". However, a transform that
writes to this system might be unable to provide exactly-once due to retries in
case of failures. In that case, test that the transform disallows specifying
exactly-once QoS, rather than failing to provide the expected end-to-end
semantics at runtime.
-* Test that each `withFoo()` method (including each overload) has effect (is
not ignored), using `TestPipeline` and `PAssert` to create tests where the
expected test results depend on the value of `withFoo()`.
+
+- Test non-trivial validation code, where missing/incorrect/uninformative
validation may lead to serious problems: data loss, counter-intuitive behavior,
value of a property being silently ignored, or other hard-to-debug errors.
Create 1 test per non-trivial class of validation error. Some examples of
validation that should be tested:
+ - If properties `withFoo()` and `withBar()` cannot both be specified at the
same time, test that a transform specifying both of them is rejected, rather
than one of the properties being silently ignored at runtime.
+ - If the transform is known to behave incorrectly or counter-intuitively for
a particular configuration, test that this configuration is rejected, rather
than producing wrong results at runtime. For example, a transform might work
properly only for bounded collections, or only for globally-windowed
collections. Or, suppose a streaming system supports several levels of "quality
of service", one of which is "exactly once delivery". However, a transform that
writes to this system might be unable to provide exactly-once due to retries in
case of failures. In that case, test that the transform disallows specifying
exactly-once QoS, rather than failing to provide the expected end-to-end
semantics at runtime.
+- Test that each `withFoo()` method (including each overload) has effect (is
not ignored), using `TestPipeline` and `PAssert` to create tests where the
expected test results depend on the value of `withFoo()`.
Do not:
-* Do not test successful validation (e.g. "validation does not fail when the
transform is configured correctly")
-* Do not test trivial validation errors (e.g. "validation fails when a
property is unset/null/empty/negative/...")
+
+- Do not test successful validation (e.g. "validation does not fail when the
transform is configured correctly")
+- Do not test trivial validation errors (e.g. "validation fails when a
property is unset/null/empty/negative/...")
### Compatibility
Do:
-* Generally, follow the rules of [semantic versioning](https://semver.org/).
-* If the API of the transform is not yet stable, annotate it as
`@Experimental` (Java) or `@experimental`
([Python](https://beam.apache.org/releases/pydoc/{{< param release_latest
>}}/apache_beam.utils.annotations.html)).
-* If the API deprecated, annotate it as `@Deprecated` (Java) or `@deprecated`
([Python](https://beam.apache.org/releases/pydoc/{{< param release_latest
>}}/apache_beam.utils.annotations.html)).
-* Pay attention to the stability and versioning of third-party classes exposed
by the transform's API: if they are unstable or improperly versioned (do not
obey [semantic versioning](https://semver.org/)), it is better to wrap them in
your own classes.
+- Generally, follow the rules of [semantic versioning](https://semver.org/).
+- If the API of the transform is not yet stable, annotate it as
`@Experimental` (Java) or `@experimental`
([Python](https://beam.apache.org/releases/pydoc/{{< param release_latest
>}}/apache_beam.utils.annotations.html)).
+- If the API deprecated, annotate it as `@Deprecated` (Java) or `@deprecated`
([Python](https://beam.apache.org/releases/pydoc/{{< param release_latest
>}}/apache_beam.utils.annotations.html)).
+- Pay attention to the stability and versioning of third-party classes exposed
by the transform's API: if they are unstable or improperly versioned (do not
obey [semantic versioning](https://semver.org/)), it is better to wrap them in
your own classes.
Do not:
-* Do not silently change the behavior of the transform, in a way where code
will keep compiling but will do something different than the previously
documented behavior (e.g. produce different output or expect different input,
of course unless the previous output was incorrect).
-Strive to make such incompatible behavior changes cause a compile error (e.g.
it's better to introduce a new transform for a new behavior and deprecate and
then delete the old one (in a new major version), than to silently change the
behavior of an existing transform), or at least a runtime error.
-* If the behavior of the transform stays the same and you're merely changing
implementation or API - do not change API of the transform in a way that will
make a user's code stop compiling.
Review comment:
Similarly, there are still a lot of these changes where just bullet
points are changed. Please revert these.
I'd suggest just using `git checkout website-revamp --
website/www/site/content/en/contribute/ptransform-style-guide.md` for each
affected file.
Then if it is absolutely necessary to switch from java to language="java"
you might use a tool like `sed` to make the changes all at once.
##########
File path: website/www/site/content/en/blog/beam-2.22.0.md
##########
@@ -18,7 +18,7 @@ See the License for the specific language governing
permissions and
limitations under the License.
-->
We are happy to present the new 2.22.0 release of Beam. This release includes
both improvements and new functionality.
-See the [download page](/get-started/downloads/#2220-2020-06-08) for this
release.<!--more-->
+See the [download page](/get-started/downloads/#2220-2020-06-08) for this
release.
Review comment:
It looks like this change is backing out some recent changes to blog
posts, as well as deleting some recent blog posts (e.g. the 2.26.0 and 2.27.0
post, and my own dataframe-api-preview-available). Can you bring these back?
##########
File path: website/www/site/content/en/contribute/practises.md
##########
@@ -0,0 +1,66 @@
+---
+title: "Beam community’s practises"
+layout: "arrow_template"
+---
+
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Beam community’s practises
+
+{{< figure src="/images/community/messages-icon.svg" >}}
+
+## Knows, upholds, and reinforces the Beam community’s practices
+
+- They have a proven commitment to the project
+- They share their intentions with the community
+- They accept and integrate community feedback in their plans, designs, code,
etc.
+- They earnestly try to make Beam better with their contributions
+
+In particular, if a code contributor:
+
+- They earnestly try to make Beam better with their own code
+- They earnestly try to make Beam better with code review
+- They accept and integrate feedback on their code
+- They know, follow, and enforce Beam’s practices while reviewing/merging code
- style, documentation, testing, backward compatibility, etc.
+
+{{< figure src="/images/community/beam-logo-icon.svg" >}}
+
+## Knows, upholds, and reinforces the Apache Software Foundation code of
conduct
+
+In particular, we manifestly strive to:
+
+- Be open
+- Be empathetic
+- Be welcoming
+- Be friendly
+- Be patient
+- Be collaborative
+- Be inquisitive
+- Be careful in the words that they choose
+
+[To learn more see the ASF documentation.](https://httpd.apache.org/docs/)
+
+{{< figure src="/images/community/diamond-icon.svg" >}}
+
+## Knows, upholds, and reinforces the responsibilities of an Apache Software
Foundation committer
+
+- They help create a product that will outlive the interest of any particular
volunteer (including themselves)
+- They grow and maintain the health of the Apache community
+- They help out with surrounding work, such as the website & documentation
+- They help users
+- They can be trusted to decide when code is ready for release, or when to ask
someone else to make the judgment
+- They can be trusted to decide when to merge code (if a code contributor) or
when to ask someone else to make the judgment
+
+[To learn more see the ASF documentation.](https://httpd.apache.org/docs/)
Review comment:
I can't find this page in the TDD to confirm that the copy is correct.
Is this another page where we're waiting on final copy from @griscz?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]