[
https://issues.apache.org/jira/browse/BEAM-217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kenneth Knowles updated BEAM-217:
---------------------------------
Fix Version/s: First stable release
> BoundedSource.splitAtFraction should be splitAfterFraction
> ----------------------------------------------------------
>
> Key: BEAM-217
> URL: https://issues.apache.org/jira/browse/BEAM-217
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Reporter: Eugene Kirpichov
> Assignee: Eugene Kirpichov
> Priority: Minor
> Labels: backward-incompatible
> Fix For: First stable release
>
>
> Dynamic work rebalancing works by 1) determining how long the bundle should
> take in order to not be a straggler - the "deadline", 2) predicting where the
> bundle will be (position or fraction) by that deadline, and 3) requesting an
> atomic split (splitAtFraction).
> Currently all BoundedSource's and (in Dataflow runner) NativeReaderIterator's
> refuse splits if they have already consumed the requested split position.
> Splitting a task [A, C) at position B generates [A, B) and [B, C), so if we
> predict that by deadline the task will have last consumed position X, we
> should split not "at" X, but "after" X (i.e. at next(X)) - i.e. into [A, X]
> (because X is already consumed) and (X, C) equivalently [A, next(X)) and
> [next(X), C).
> One way to fit this into the current BoundedSource API is to rename
> splitAtFraction to splitAfterFraction and adjust the documentation.
> Documentation of getFractionConsumed also needs to be clarified to emphasize
> that it should return what fraction of all positions in the source have
> already been consumed, including the position of the last consumed record.
> For example, for an index-range task with range [0, 5), after it has read the
> first record at position 0, it has consumed 20%, rather than 0% (and of
> course not 40% even if an internal "next index" variable is now 1 - this
> mistake is especially easy to make in a file-based source if you base the
> calculations on the file's offset *after* consuming the record - the correct
> way is to calculate based on offsets of beginning of records).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)