[ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703642#comment-16703642
 ] 

Maximilian Michels commented on BEAM-5865:
------------------------------------------

I think the stable input is relevant for continuing to write from a checkpoint, 
especially when exactly-once semantics should be obeyed which requires 
idempotent writes. For files this should not be possible anyways.

The key is assigned randomly as of now. It's in the 
{{WritesFiles$WriteShardedBundlesToTempFiles}}, but we could decide to use a 
different partitioning strategy for achieving balanced partitions. But we can 
only do that if we know that we're not executing a normal Reshuffle. So the 
code there would have to be altered to use a custom URN different from 
Reshuffle.

We would then have to embed translation code in the FlinkRunner translation 
classes where we can influence the partitioning. A simple transform replacement 
would not be enough, as you already pointed out.



> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to