[ 
https://issues.apache.org/jira/browse/BEAM-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708749#comment-16708749
 ] 

Jozef Vilcek commented on BEAM-5865:
------------------------------------

I had a bit of clarification chat with [~mxm] over slack. Here I post back what 
I make out of it:

FileIO when writing to files is doing sort of :     
{noformat}
PCollection[Value] -> PCollection[KV[RandomShardId, Value]] -> GBK -> 
PCollection[KV[RandomShardId, Iterable[Value]]] -> Write Iterable[Value] to 
file{noformat}
GBK in Flink is translated to KeyedStream, which (not 100% certain) always use 
hash partitioning and therefore can not be easily changed to balance out shard 
processing. Anyway, we want to be even better and do some sort of local 
sharding where we replace 
{noformat}
PCollection[KV[RandomShardId, Value]]{noformat}
for 
{noformat}
PCollection[KV[FlinkSubtaskID, Value]]{noformat}
however, to connect back to `FileIO.Sink` for writes, we need to make back some 
sort of group of values to be written in to the file. To be able to recognize 
which GBK to optimise (which one is FileIO related, a new URN can be created.

 

Open question is, how to, from available Beam and Flink API point of view, 
construct step:
{noformat}
PCollection[KV[FlinkSubtaskID, Value]] -> PCollection[KV[FlinkSubtaskID, 
Iterable[Value]]]{noformat}
which would do chop worker local `Values` stream to groups writable by 
`FileIO.Sink` interface.

 

> Auto sharding of streaming sinks in FlinkRunner
> -----------------------------------------------
>
>                 Key: BEAM-5865
>                 URL: https://issues.apache.org/jira/browse/BEAM-5865
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-flink
>            Reporter: Maximilian Michels
>            Priority: Major
>
> The Flink Runner should do auto-sharding of streaming sinks, similar to 
> BEAM-1438. That way, the user doesn't have to set shards manually which 
> introduces additional shuffling and might cause skew in the distribution of 
> data.
> As per discussion: 
> https://lists.apache.org/thread.html/7b92145dd9ae68da1866f1047445479f51d31f103d6407316bb4114c@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to