Re: [PR] Fix Kafka with Redistribute [beam]

via GitHub Thu, 12 Sep 2024 03:29:13 -0700


scwhittle commented on code in PR #32344:
URL: https://github.com/apache/beam/pull/32344#discussion_r1756577998



##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java:
##########
@@ -2654,6 +2659,13 @@ public PCollection<KafkaRecord<K, V>> 
expand(PCollection<KafkaSourceDescriptor>
         if (getRedistributeNumKeys() == 0) {
           LOG.warn("This will create a key per record, which is sub-optimal 
for most use cases.");
         }
+        // is another check here needed for with commit offsets
+        if (isCommitOffsetEnabled() || configuredKafkaCommit()) {

Review Comment:
   kafka autocommit can be ahead of records processed because the autocommit 
occurs when the consumer reads the data but the read data may be dropped if the 
commit to dataflow backend fails.  Dataflow won't lose data in this case since 
it will reread from the previous offset which it stores internally but the auto 
commit offset will be further ahead than that offset.
   
   That is true in all cases but if the pipeline is drained at that point the 
commitOffsetEnabled would be eventually correct matching what was actually 
processed by the pipeline while the configuredKafkaCommit would remain 
incorrect. 
   
   Separately should this log only be if there is redistribute withDuplicates 
allowed?  If not, the redistribute is just like a reshuffle which will have 
normal semantics.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix Kafka with Redistribute [beam]

Reply via email to