Re: [PR] Fix Kafka with Redistribute [beam]

via GitHub Wed, 28 Aug 2024 11:27:57 -0700


Naireen commented on PR #32344:
URL: https://github.com/apache/beam/pull/32344#issuecomment-2316002328


   > One other benefit of the committed offsets is if it aids customer 
visibility in to progress of partitions as that can be queried/displayed 
external to dataflow.
   
   That is a good point, so is an argument to still allow for it. What I dont 
understand though, is how does reshuffling/redistributing work with commiting 
offsets? Here is the current graph wiht offsets enabled:
   
    
    PCollection<KafkaSourceDescriptor> --> 
ParDo(ReadFromKafkaDoFn<KafkaSourceDescriptor, KV<KafkaSourceDescriptor, 
KafkaRecord>>) --> Reshuffle() --> Map(output KafkaRecord)
                                                                                
                                                               |
                                                                                
                                                              --> 
KafkaCommitOffset
    
   At that point, if redistribute is enabled, does it make more sense to 
substitute the Reshuffle here for the Redistribute transform? Instead of 
inserting a reshuffle after the Map? (this would introduce another shuffle 
based on runner implementation)
   
   If we go with the former approach, commits will still occur, but the commits 
of the commits can have duplicates (need to investigate what that can cause, or 
will it just be a no op if we attempt to commit the same offset internally in 
BEAM twice?)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix Kafka with Redistribute [beam]

Reply via email to