[
https://issues.apache.org/jira/browse/BEAM-9748?focusedWorklogId=422415&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-422415
]
ASF GitHub Bot logged work on BEAM-9748:
----------------------------------------
Author: ASF GitHub Bot
Created on: 14/Apr/20 23:00
Start Date: 14/Apr/20 23:00
Worklog Time Spent: 10m
Work Description: jkff commented on pull request #11406: [BEAM-9748] Move
Reparallelize transform to Reshuffle
URL: https://github.com/apache/beam/pull/11406#discussion_r408484775
##########
File path:
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java
##########
@@ -65,6 +66,16 @@ private Reshuffle() {}
return new ViaRandomKey<>();
}
+ @Experimental
Review comment:
Annotation should probably be below the comment.
Also suggest rephrasing a bit, to explain when one should use one or the
other, and what are the consequences of choosing wrong. I'm not sure that
"sequentially generated" is clear enough to a casual user. Maybe something like
this:
```
Materializes the input and prepares it to be consumed in a highly parallel
fashion.
This version is tailored to the case when input was produced in an extremely
sequential way - typically by a ParDo that emits millions of outputs _per input
element_, e.g., executing a large database query or a large simulation and
emitting all of their results.
Internally, this version first materializes the input at a moderate cost
before reshuffling it internally using viaRandomKey(), making the reshuffling
itself significantly cheaper in these extreme cases on some runners. Use this
over viaRandomKey() only if your benchmarks show an improvement.
```
And mention this at the class-level documentation of Reshuffle for
visibility.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 422415)
Time Spent: 1h 50m (was: 1h 40m)
> Add Reshuffle.ForSequentiallyGeneratedInput transform
> -----------------------------------------------------
>
> Key: BEAM-9748
> URL: https://issues.apache.org/jira/browse/BEAM-9748
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Reporter: Ismaël Mejía
> Assignee: Ismaël Mejía
> Priority: Minor
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> Some DoFn based IOs like JdbcIO and RedisIO rely on a different approach to
> Reparallelize outputs using a combination of a an empty PCollectionView to
> force materialization and Reshuffle.viaRandomkey to reparallelize a
> PCollection. This issue extracts this transform and expose it as part of the
> Reshuffle to avoid repeating the code for transforms (notably IOs) that
> produce lots of sequentially generated data where and benefit of this
> alternative approach to perform better reparallelization of its output.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)