[
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737241#comment-16737241
]
Maximilian Michels commented on BEAM-5865:
------------------------------------------
You're right that we need a shuffle/keyBy to do windowing in the first place.
However, that is not our concern because we merely want to avoid a reshuffle
before we write to the file system, to keep locality of the data.
I think your approach is interesting. I'm not sure if I'd recommend doing the
key generation hack you linked to achieve equal partitioning, but I could see
that we provide a means to provide a custom sharding function for the file io.
As for reinterpretAsKeyedStream I think we would do best to provide a means to
skip reshuffling, as the stream is already keyed and we could apply the
sharding function and directly write it to HDFS. This also requires a change in
the file io to allow skipping the existing reshuffling logic which was meant to
provide stable input but is not required in the case of Flink.
{quote}
I am not familiar on how Flink allocates CPU times and buffering between
operators and slot groups so can not fully reason about it.
{quote}
IMHO it does not provide dedicate CPU to tasks. Each task runs in a separate
thread and as of now they basically compete against resources. Chained
operators run within a single task.
> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
> Key: BEAM-5865
> URL: https://issues.apache.org/jira/browse/BEAM-5865
> Project: Beam
> Issue Type: Improvement
> Components: runner-flink
> Reporter: Maximilian Michels
> Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to
> BEAM-1438. That way, the user doesn't have to set shards manually which
> introduces additional shuffling and might cause skew in the distribution of
> data.
> As per discussion:
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)