[ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730642#comment-16730642
 ] 

Jozef Vilcek commented on BEAM-5865:
------------------------------------

I have experimented with this (write keyed stream without need of a shuffle) an 
modified FlinkRunner to:
1. Replace WriteFiles.ApplyShardingKeyFn with alternative which generates keys 
fit for current subtask index
2. Replace keyBy in GroupByKeyTranslator to use reinterpretAsKeyedStream in 
case of sharding

It works as expected and data seems to be written HDFS from task which reads 
from kafka (much less shuffle over network). 
However, actual throughput did drop. I am quite not sure why and do not have 
much ideas on how to identify the reason.

> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to