Naireen commented on PR #32344:
URL: https://github.com/apache/beam/pull/32344#issuecomment-2316002328
> One other benefit of the committed offsets is if it aids customer
visibility in to progress of partitions as that can be queried/displayed
external to dataflow.
That is a good point, so is an argument to still allow for it. What I dont
understand though, is how does reshuffling/redistributing work with commiting
offsets? Here is the current graph wiht offsets enabled:
PCollection<KafkaSourceDescriptor> -->
ParDo(ReadFromKafkaDoFn<KafkaSourceDescriptor, KV<KafkaSourceDescriptor,
KafkaRecord>>) --> Reshuffle() --> Map(output KafkaRecord)
|
-->
KafkaCommitOffset
At that point, if redistribute is enabled, does it make more sense to
substitute the Reshuffle here for the Redistribute transform? Instead of
inserting a reshuffle after the Map? (this would introduce another shuffle
based on runner implementation)
If we go with the former approach, commits will still occur, but the commits
of the commits can have duplicates (need to investigate what that can cause, or
will it just be a no op if we attempt to commit the same offset internally in
BEAM twice?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]