[ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737241#comment-16737241
 ] 

Maximilian Michels commented on BEAM-5865:
------------------------------------------

You're right that we need a shuffle/keyBy to do windowing in the first place. 
However, that is not our concern because we merely want to avoid a reshuffle 
before we write to the file system, to keep locality of the data. 

I think your approach is interesting. I'm not sure if I'd recommend doing the 
key generation hack you linked to achieve equal partitioning, but I could see 
that we provide a means to provide a custom sharding function for the file io.

As for reinterpretAsKeyedStream I think we would do best to provide a means to 
skip reshuffling, as the stream is already keyed and we could apply the 
sharding function and directly write it to HDFS. This also requires a change in 
the file io to allow skipping the existing reshuffling logic which was meant to 
provide stable input but is not required in the case of Flink. 

{quote}
 I am not familiar on how Flink allocates CPU times and buffering between 
operators and slot groups so can not fully reason about it.
{quote}

IMHO it does not provide dedicate CPU to tasks. Each task runs in a separate 
thread and as of now they basically compete against resources. Chained 
operators run within a single task.

> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to