[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

jwbear Thu, 22 Sep 2016 22:40:29 -0700

Github user jwbear commented on the issue:

    https://github.com/apache/spark/pull/15102
  
    Just curious looking at this, if you are comparing "sequential" offsets 
across partitions a rebalance would definitely affect this and, unless 
something has changed, it probably not a good idea to compare offsets from 
kafka across partitions. You could simply add an id/timestamp to the producer 
and send it with the message rather than using this methodology or if you must 
use offset query the broker for the full list and compare what you consumed to 
that list (small increase in latency btwn consumption and processing).  This is 
from the Kafka paper, which makes me question your scheme: "...Note that our 
message ids are increasing but not consecutive. To compute the id of the next 
message, we have to add the length of the current message to its id." This 
means simply comparing which offsets are larger will not necessarily yield you 
the most recent message across partitions and definitely won't hold in a 
rebalance during which time some broker logs will be on hold and not consumed.
  In my own implementation, the offsets are great for message guarantees (eg 
delivery/consumption checks), because the broker has a full ordered list, but 
not for cross partition ordering.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15102: [SPARK-17346][SQL] Add Kafka source for Structured Strea...

Reply via email to