Github user zsxwing commented on a diff in the pull request:
https://github.com/apache/spark/pull/22042#discussion_r212033844
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
---
@@ -337,6 +338,7 @@ private[kafka010] case class
KafkaMicroBatchInputPartitionReader(
val record = consumer.get(nextOffset, rangeToRead.untilOffset,
pollTimeoutMs, failOnDataLoss)
if (record != null) {
nextRow = converter.toUnsafeRow(record)
+ nextOffset = record.offset + 1
--- End diff --
We should update `nextOffset` to `record.offset + 1` rather that
`nextOffset + 1`. Otherwise, it may return duplicated records when
`failOnDataLoss` is `false`. I will submit another PR to push this fix to 2.3
as it's a correctness issue.
In addition, we should change `nextOffset` in the `next` method as the
`get` method is designed to be called multiple times.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]