Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/21917
> How do you know that offset 4 isn't just lost because poll failed?
By failed, you mean returned an empty collection after timing out, even
though records should be available? You don't. You also don't know that it
isn't just lost because kafka skipped a message. AFAIK from the information
you have from a kafka consumer, once you start allowing gaps in offsets, you
don't know.
I understand your point, but even under your proposal you have no guarantee
that the poll won't work in your first pass during RDD construction, and then
fail on the executor during computation, right?
> The issue with your proposal is that SeekToEnd gives you the last offset
which might not be the last record.
Have you tested comparing the results of consumer.endOffsets for consumers
with different isolation levels?
Your proposal might end up being the best approach anyway, just because of
the unfortunate effect of StreamInputInfo and count, but I want to make sure we
think this through.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]