[jira] [Commented] (BEAM-5865) Auto sharding of streaming sinks in FlinkRunner

Jozef Vilcek (JIRA) Mon, 17 Dec 2018 04:03:42 -0800


    [ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722919#comment-16722919
 ]


Jozef Vilcek commented on BEAM-5865:
------------------------------------

I do not strictly want to write partitions as they are. I want to write kafka 
data effectively. To be able to do it without shuffling, would be great. 
However I am under impression that windowing requires keyed pcollection which 
will result in shuffle.

As a first step, I am even OK with shuffle, but do not want to end up with hot 
spots after shuffle where one worker is managing e.g. 3 keys -> writing 3 files 
and holding that state before flush and some other workers doing zero keys.

How can I do custom transform + make sure that under errors / restarts I will 
not be observing duplicate events in this transform? Can not see a way how to 
make this work with state. I will be grateful for any hints.

> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-5865) Auto sharding of streaming sinks in FlinkRunner

Reply via email to