[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

zsxwing Wed, 22 Aug 2018 10:08:18 -0700

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22042#discussion_r212033844
  
    --- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
 ---
    @@ -337,6 +338,7 @@ private[kafka010] case class 
KafkaMicroBatchInputPartitionReader(
           val record = consumer.get(nextOffset, rangeToRead.untilOffset, 
pollTimeoutMs, failOnDataLoss)
           if (record != null) {
             nextRow = converter.toUnsafeRow(record)
    +        nextOffset = record.offset + 1
    --- End diff --
    
     We should update `nextOffset` to `record.offset + 1` rather that 
`nextOffset + 1`. Otherwise, it may return duplicated records when 
`failOnDataLoss` is `false`. I will submit another PR to push this fix to 2.3 
as it's a correctness issue.
    
    In addition, we should change `nextOffset` in the `next` method as the 
`get` method is designed to be called multiple times.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22042: [SPARK-25005][SS]Support non-consecutive offsets ...

Reply via email to