jkff commented on a change in pull request #11406: [BEAM-9748] Move
Reparallelize transform to Reshuffle
URL: https://github.com/apache/beam/pull/11406#discussion_r408484775
##########
File path:
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java
##########
@@ -65,6 +66,16 @@ private Reshuffle() {}
return new ViaRandomKey<>();
}
+ @Experimental
Review comment:
Annotation should probably be below the comment.
Also suggest rephrasing a bit, to explain when one should use one or the
other, and what are the consequences of choosing wrong. I'm not sure that
"sequentially generated" is clear enough to a casual user. Maybe something like
this:
```
Materializes the input and prepares it to be consumed in a highly parallel
fashion.
This version is tailored to the case when input was produced in an extremely
sequential way - typically by a ParDo that emits millions of outputs _per input
element_, e.g., executing a large database query or a large simulation and
emitting all of their results.
Internally, this version first materializes the input at a moderate cost
before reshuffling it internally using viaRandomKey(), making the reshuffling
itself significantly cheaper in these extreme cases on some runners. Use this
over viaRandomKey() only if your benchmarks show an improvement.
```
And mention this at the class-level documentation of Reshuffle for
visibility.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services