Github user QuentinAmbard commented on the issue:
https://github.com/apache/spark/pull/21917
If you are doing it in advance you'll change the range, so for example you
read until 3 and don't get any extra results. Maybe it's because of a
transaction offset, maybe another issue, it's ok in both cases.
The big difference is that the next batch will restart from offset 3 and
poll from this value. If seek to 3 and poll get you another result (for example
6) then everything is fine it's not a data loss it's just a gap.
The issue with your proposal is that SeekToEnd gives you the last offset
which might not be the last record.
So in your example if last offset is 5 and after a few poll the last record
you get is 3 what do you do, continue and execute the next batch from 5? How do
you know that offset 4 isn't just lost because poll failed?
The only way to know that would be to get a record with an offset higher
than 5. In this case you know it's just a gap.
But if the message you are reading is the last of the topic you won't have
records higher than 3, do you can't tell if it's a poll failure or an empty
offset because of the transaction commit
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]