Github user viirya commented on the pull request:
https://github.com/apache/spark/pull/10197#issuecomment-163313509
As I can see from the implementation, the reason direct stream has some
latency is because it is going to generate the rdd after each batch window
finishes. So it certainly introduces extra latency compared with receiver-based
KafkaInputDStream which continues to produce blocks that are formed the base of
rdd later. The latency will grow as you increase your batch duration as it
takes longer to generate the rdd.
The codes are mostly as same as receiver-based input dstream and
DirectKafkaInputDStream. Maybe we can refactor it later to share most of the
codes.
For multiple receivers, different receivers should handle different
topic-partitions.
You are right. It should not silently catche these exceptions. This should
be fixed. The checkpoint description should be removed too.
@koeninger Thanks for comments. I am not sure if this is useful to others.
But for our cases, we need a Kafka input dstream which has exactly once feature
and does not introduce latency as DirectKafkaInputDStream did. Observed on our
tests, it actually helps reduce the extra latency.
As it is very initial attempt, I will close it now and see if we can reopen
it later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]