[
https://issues.apache.org/jira/browse/BEAM-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720192#comment-16720192
]
Maximilian Michels commented on BEAM-2873:
------------------------------------------
Just found this related issue. I think we should go ahead and finish up the
Runner-determined sharding.
Afterwards, we could think of additional work to avoid the Reshuffle when
desired, as discussed in BEAM-5865.
> Detect number of shards for file sink in Flink Streaming Runner
> ---------------------------------------------------------------
>
> Key: BEAM-2873
> URL: https://issues.apache.org/jira/browse/BEAM-2873
> Project: Beam
> Issue Type: Improvement
> Components: runner-flink
> Reporter: Aljoscha Krettek
> Assignee: Dawid Wysakowicz
> Priority: Major
> Time Spent: 3h 40m
> Remaining Estimate: 0h
>
> [~reuvenlax] mentioned that this is done for the Dataflow Runner and the
> default behaviour on Flink can be somewhat surprising for users.
> ML entry: https://www.mail-archive.com/[email protected]/msg02665.html:
> This is how the file sink has always worked in Beam. If no sharding is
> specified, then this means runner-determined sharding, and by default that is
> one file per bundle. If Flink has small bundles, then I suggest using the
> withNumShards method to explicitly pick the number of output shards.
> The Flink runner can detect that runner-determined sharding has been chosen,
> and override it with a specific number of shards. For example, the Dataflow
> streaming runner (which as you mentioned also has small bundles) detects this
> case and sets the number of out files shards based on the number of workers
> in the worker pool
> [Here|https://github.com/apache/beam/blob/9e6530adb00669b7cf0f01cb8b128be0a21fd721/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L354]
> is the code that does this; it should be quite simple to do something
> similar for Flink, and then there will be no need for users to explicitly
> call withNumShards themselves.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)