jkff commented on a change in pull request #11406: [BEAM-9748] Move 
Reparallelize transform to Reshuffle
URL: https://github.com/apache/beam/pull/11406#discussion_r408484775
 
 

 ##########
 File path: 
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java
 ##########
 @@ -65,6 +66,16 @@ private Reshuffle() {}
     return new ViaRandomKey<>();
   }
 
+  @Experimental
 
 Review comment:
   Annotation should probably be below the comment.
   
   Also suggest rephrasing a bit, to explain when one should use one or the 
other, and what are the consequences of choosing wrong. I'm not sure that 
"sequentially generated" is clear enough to a casual user. Maybe something like 
this:
   
   ```
   Materializes the input and prepares it to be consumed in a highly parallel 
fashion.
   
   This version is tailored to the case when input was produced in an extremely 
sequential way - typically by a ParDo that emits millions of outputs _per input 
element_, e.g., executing a large database query or a large simulation and 
emitting all of their results.
   
   Internally, this version first materializes the input at a moderate cost 
before reshuffling it internally using viaRandomKey(), making the reshuffling 
itself significantly cheaper in these extreme cases on some runners. Use this 
over viaRandomKey() only if your benchmarks show an improvement.
   ```
   
   And mention this at the class-level documentation of Reshuffle for 
visibility.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to