[ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16701682#comment-16701682
 ] 

Maximilian Michels commented on BEAM-5865:
------------------------------------------

I think it would be fine to assume evenly distributed data in the case where we 
skip the reshuffle and just use Flink's subtask index as the key for writes. 
[~robertwb] mentioned that we may have to provide stable inputs for sinks and 
that's why an actual Reshuffle using a key might be necessary. Though for 
streaming that doesn't have to be necessary either.

In the above, just setting the number of shards wouldn't be enough, we still 
have to avoid the Reshuffle.

> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to