scwhittle commented on code in PR #32344:
URL: https://github.com/apache/beam/pull/32344#discussion_r1756577998
##########
sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java:
##########
@@ -2654,6 +2659,13 @@ public PCollection<KafkaRecord<K, V>>
expand(PCollection<KafkaSourceDescriptor>
if (getRedistributeNumKeys() == 0) {
LOG.warn("This will create a key per record, which is sub-optimal
for most use cases.");
}
+ // is another check here needed for with commit offsets
+ if (isCommitOffsetEnabled() || configuredKafkaCommit()) {
Review Comment:
kafka autocommit can be ahead of records processed because the autocommit
occurs when the consumer reads the data but the read data may be dropped if the
commit to dataflow backend fails. Dataflow won't lose data in this case since
it will reread from the previous offset which it stores internally but the auto
commit offset will be further ahead than that offset.
That is true in all cases but if the pipeline is drained at that point the
commitOffsetEnabled would be eventually correct matching what was actually
processed by the pipeline while the configuredKafkaCommit would remain
incorrect.
Separately should this log only be if there is redistribute withDuplicates
allowed? If not, the redistribute is just like a reshuffle which will have
normal semantics.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]