[ 
https://issues.apache.org/jira/browse/FLINK-15670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108783#comment-17108783
 ] 

Yuan Mei edited comment on FLINK-15670 at 5/16/20, 3:44 AM:
------------------------------------------------------------

Things to follow up and discuss (listed here in case my forgetting about them):

 
 # Address cases when # of partitions != # of consumer tasks
 # Emitting Kafka fetcher records in a batched (bundled) mode similar to 
`FLINK-17307 Add collector to deserialize in KafkaDeserializationSchema`
 # Whether to separate sink (producer) and source (consumer) to different jobs. 
 ** Although they (sink and source) are recovered independently according to 
regional failover, they share the same checkpoint coordinator, and 
correspondingly share the same global checkpoint snapshot
 ** That says if the consumer fails, the producer can not commit written data 
because of two-phase commit set-up (the producer needs a checkpoint-complete 
signal to complete the second stage)
 # I found the current "ending of stream" logic in KafkaFetcher a bit strange: 
if any partition has a record signaled as "END_OF_STREAM", the fetcher will 
stop running. Notice that the signal is coming from the deserializer, which 
means from Kafka data itself. But it is possible that other topics and 
partitions still have data to read. Finishing reading Partition0 can not 
guarantee that Partition1 also finishes. Maybe I misunderstand what an 
"END_OF_STREAM" signal means. 

 


was (Author: ym):
Things to follow up and discuss (listed here in case my forgetting about them):

 
 # Address cases when # of partitions != # of consumer tasks
 # Batch emitting Kafka fetcher records (similar to FLINK-17307 Add collector 
to deserialize in KafkaDeserializationSchema)
 # Whether to separate sink (producer) and source (consumer) to different jobs. 
 ** Although they (sink and source) are recovered independently according to 
regional failover, they share the same checkpoint coordinator, and 
correspondingly share the same global checkpoint snapshot
 ** That says if the consumer fails, the producer can not commit written data 
because of two-phase commit set-up (the producer needs a checkpoint-complete 
signal to complete the second stage)
 # I found the current "ending of stream" logic in KafkaFetcher a bit strange: 
if any partition has a record signaled as "END_OF_STREAM", the fetcher will 
stop running. Notice that the signal is coming from the deserializer, which 
means from Kafka data itself. But it is possible that other topics and 
partitions still have data to read. Finishing reading Partition0 can not 
guarantee that Partition1 also finishes. Maybe I misunderstand what an 
"END_OF_STREAM" signal means. 

 

> Provide a Kafka Source/Sink pair that aligns Kafka's Partitions and Flink's 
> KeyGroups
> -------------------------------------------------------------------------------------
>
>                 Key: FLINK-15670
>                 URL: https://issues.apache.org/jira/browse/FLINK-15670
>             Project: Flink
>          Issue Type: New Feature
>          Components: API / DataStream, Connectors / Kafka
>            Reporter: Stephan Ewen
>            Assignee: Yuan Mei
>            Priority: Major
>              Labels: pull-request-available, usability
>             Fix For: 1.11.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This Source/Sink pair would serve two purposes:
> 1. You can read topics that are already partitioned by key and process them 
> without partitioning them again (avoid shuffles)
> 2. You can use this to shuffle through Kafka, thereby decomposing the job 
> into smaller jobs and independent pipelined regions that fail over 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to