[
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697914#comment-16697914
]
Jozef Vilcek commented on BEAM-5865:
------------------------------------
I will do my best to figure it out and find enough time for it.
You will have to bear with me as these is very Beam internal and fuzzy to me. I
had a look to suggested example from DataFlow runner
[https://github.com/apache/beam/commit/cbb922c8a72680c5b8b4299197b515abf650bfdf#diff-a79d5c3c33f6ef1c4894b97ca907d541R347]
That seems straightforward enough, but here we need to do something more
evolved. Right now, I can picture these 2 tasks here:
# If user do not specify numShards, runner should not crash but pick some
default (e.g. numWorkers - flink parallelism)
# Make sure shards are distributed evenly among the workers and do not produce
hot spots
The first one can be reproduced from given PR example for DataFlow. The second
one I have no idea how to proceed. Any hints on where to look?
> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
> Key: BEAM-5865
> URL: https://issues.apache.org/jira/browse/BEAM-5865
> Project: Beam
> Issue Type: Improvement
> Components: runner-flink
> Reporter: Maximilian Michels
> Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to
> BEAM-1438. That way, the user doesn't have to set shards manually which
> introduces additional shuffling and might cause skew in the distribution of
> data.
> As per discussion:
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)