[
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729378#comment-16729378
]
Jozef Vilcek commented on BEAM-5865:
------------------------------------
Feedback from my experiments:
# I have modified Beam to be able to specify sharding function for FileIO,
which is called to create ShardedKey<Integer> per input element. I have
supplied a function which creates special key to land in desired flink worker,
similar as suggested here:
[http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Use-keyBy-to-deterministically-hash-each-record-to-a-processor-task-slot-td16483.html]
This worked great and FileIO was perfectly balanced across flink workers
# I am not sure if this can be done without shuffle, reading more about flink
I am under an impression that flink operators can not do it and probably flink
Sink has internally special feature about it to be able pull worker local data
and create e.g. one file if sink wish so. Guarantees are achieved by ability to
observe checkpoint completions and manage state based on those events (e.g.
commit temp files)
I am going to submit JIRA / question to Flink regarding ability to achieve good
key distributions when num_keys ~= num_parallelism.
Do you think that it would be acceptable in Beam to use such hackish fine
generated keys for flink runner?
> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
> Key: BEAM-5865
> URL: https://issues.apache.org/jira/browse/BEAM-5865
> Project: Beam
> Issue Type: Improvement
> Components: runner-flink
> Reporter: Maximilian Michels
> Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to
> BEAM-1438. That way, the user doesn't have to set shards manually which
> introduces additional shuffling and might cause skew in the distribution of
> data.
> As per discussion:
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)