Github user QuentinAmbard commented on the issue:

    https://github.com/apache/spark/pull/21917
  
    If you are doing it in advance you'll change the range, so for example you 
read until 3 and don't get any extra results. Maybe it's because of a 
transaction offset, maybe another issue, it's ok in both cases.
    The big difference is that the next batch will restart from offset 3 and 
poll from this value. If seek to 3 and poll get you another result (for example 
6) then everything is fine  it's not a data loss it's just a gap.
    The issue with your proposal is that SeekToEnd gives you the last offset 
which might not be the last record.
    So in your example if last offset is 5 and after a few poll the last record 
you get is 3 what do you do, continue and execute the next batch from 5? How do 
you know that offset 4 isn't just lost because poll failed?
    The only way to know that would be to get a record with an offset higher 
than 5. In this case you know it's just a gap. 
    But if the message you are reading is the last of the topic you won't have 
records higher than 3, do you can't tell if it's a poll failure or an empty 
offset because of the transaction commit


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to