[GitHub] [spark] wecharyu commented on pull request #38898: [SPARK-41375][SS] Avoid empty latest KafkaSourceOffset

GitBox Wed, 07 Dec 2022 19:39:29 -0800


wecharyu commented on PR #38898:
URL: https://github.com/apache/spark/pull/38898#issuecomment-1341944936


   @jerrypeng the empty offset will be stored in `committedOffsets`, when we 
run next batch, the following code will record an empty map startOffset in 
`newBatchesPlan`:
   
https://github.com/apache/spark/blob/2179955b1e7fe9ed7ea44a4e0d794694d0f7133e/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L665
   Then while fetching partitions, all the partitions are considered as "new 
partitions" and will fetch the earliest offsets, which will produce dupicate 
data.
   
https://github.com/apache/spark/blob/2179955b1e7fe9ed7ea44a4e0d794694d0f7133e/connector/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetReaderAdmin.scala#L443-L460


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wecharyu commented on pull request #38898: [SPARK-41375][SS] Avoid empty latest KafkaSourceOffset

Reply via email to