[
https://issues.apache.org/jira/browse/BEAM-1841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Kirpichov reassigned BEAM-1841:
--------------------------------------
Assignee: Eugene Kirpichov (was: Davor Bonaci)
> FileBasedSource should better handle when set of files grows while job is
> running
> ---------------------------------------------------------------------------------
>
> Key: BEAM-1841
> URL: https://issues.apache.org/jira/browse/BEAM-1841
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow, sdk-java-core
> Reporter: Eugene Kirpichov
> Assignee: Eugene Kirpichov
>
> In some cases people run pipelines over directories where the set of files in
> the directory grows while the job runs. This may lead to a situation like
> this, in particular with Dataflow runner:
> At job submission time, the FileBasedSource estimates the current size of the
> filepattern, and ends up with a small number. Dataflow runner chooses thus a
> small desiredBundleSizeBytes to pass to .splitIntoBundles(). However, at the
> time splitIntoBundles() runs, the set of files has greatly grown, and we
> produce many more, unnecessarily small bundles, than anticipated.
> I see a few things we could do:
> - In splitIntoBundles(), compute the actual size and detect when the desired
> size is unreasonably small for it; e.g. set an upper threshold on how many
> bundles we produce in total.
> - Somehow remember, at submission time, what was the estimated size. Then, in
> splitIntoBundles(), compute the actual current size, and scale
> desiredBundleSizeBytes accordingly to get approximately the intended number
> of bundles. Caveat: files may still change between the moment size is
> estimated and the moment splitting happens.
> - (much larger in scope) Change the whole protocol to use number of bundles
> instead of bundle size bytes. This probably won't happen with BoundedSource,
> but it is going to be the case with Splittable DoFn.
> Option 1 seems by far the simplest.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)