[ 
https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047415#comment-17047415
 ] 

Yuan Mei commented on FLINK-15670:
----------------------------------

Thanks Stephan!

This is definitely a cleaner interface. The reason I separate the interface 
into two function calls is based on two considerations:

1) data reuse and 2) sub-graph failure recovery

For the first item `data resue`, I was thinking users write the shuffled data 
to Kafka mostly because they want to reuse the data somehow. Otherwise, they 
can just use normal shuffle without involving Kafka. To reuse the data, we 
would have to provide them some way to read out the data? That is why the read 
API is provided.

I can provide an one call wrap of write and read as well. 

 

The second consideration is sub-graph failure recovery. If a reader and a 
writer are running in different `StreamExecutionEnvironment`, they are going to 
do failure-recovery separately by nature. Otherwise, the internal scheduler 
would have to identify disconnected subgraph (FailoverRegion as mentioned in 
FLIP1). I am not sure whether this behavior is enabled by default or not. 

 

> Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's 
> KeyGroups
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-15670
>                 URL: https://issues.apache.org/jira/browse/FLINK-15670
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / DataStream, Connectors / Kafka
>            Reporter: Stephan Ewen
>            Priority: Major
>              Labels: usability
>             Fix For: 1.11.0
>
>
> This Source/Sink pair would serve two purposes:
> 1. You can read topics that are already partitioned by key and process them 
> without partitioning them again (avoid shuffles)
> 2. You can use this to shuffle through Kafka, thereby decomposing the job 
> into smaller jobs and independent pipelined regions that fail over 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to