Github user jwbear commented on the issue:
https://github.com/apache/spark/pull/15102
Just curious looking at this, if you are comparing "sequential" offsets
across partitions a rebalance would definitely affect this and, unless
something has changed, it probably not a good idea to compare offsets from
kafka across partitions. You could simply add an id/timestamp to the producer
and send it with the message rather than using this methodology or if you must
use offset query the broker for the full list and compare what you consumed to
that list (small increase in latency btwn consumption and processing). This is
from the Kafka paper, which makes me question your scheme: "...Note that our
message ids are increasing but not consecutive. To compute the id of the next
message, we have to add the length of the current message to its id." This
means simply comparing which offsets are larger will not necessarily yield you
the most recent message across partitions and definitely won't hold in a
rebalance during which time some broker logs will be on hold and not consumed.
In my own implementation, the offsets are great for message guarantees (eg
delivery/consumption checks), because the broker has a full ordered list, but
not for cross partition ordering.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]