[
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730642#comment-16730642
]
Jozef Vilcek commented on BEAM-5865:
------------------------------------
I have experimented with this (write keyed stream without need of a shuffle) an
modified FlinkRunner to:
1. Replace WriteFiles.ApplyShardingKeyFn with alternative which generates keys
fit for current subtask index
2. Replace keyBy in GroupByKeyTranslator to use reinterpretAsKeyedStream in
case of sharding
It works as expected and data seems to be written HDFS from task which reads
from kafka (much less shuffle over network).
However, actual throughput did drop. I am quite not sure why and do not have
much ideas on how to identify the reason.
> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
> Key: BEAM-5865
> URL: https://issues.apache.org/jira/browse/BEAM-5865
> Project: Beam
> Issue Type: Improvement
> Components: runner-flink
> Reporter: Maximilian Michels
> Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to
> BEAM-1438. That way, the user doesn't have to set shards manually which
> introduces additional shuffling and might cause skew in the distribution of
> data.
> As per discussion:
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)