Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/10197#issuecomment-163313509
  
    As I can see from the implementation, the reason direct stream has some 
latency is because it is going to generate the rdd after each batch window 
finishes. So it certainly introduces extra latency compared with receiver-based 
KafkaInputDStream which continues to produce blocks that are formed the base of 
rdd later. The latency will grow as you increase your batch duration as it 
takes longer to generate the rdd.
    
    The codes are mostly as same as receiver-based input dstream and 
DirectKafkaInputDStream. Maybe we can refactor it later to share most of the 
codes.
    
    For multiple receivers, different receivers should handle different 
topic-partitions.
    
    You are right. It should not silently catche these exceptions. This should 
be fixed. The checkpoint description should be removed too.
    
    @koeninger  Thanks for comments. I am not sure if this is useful to others. 
But for our cases, we need a Kafka input dstream which has exactly once feature 
and does not introduce latency as DirectKafkaInputDStream did. Observed on our 
tests, it actually helps reduce the extra latency.
    
    As it is very initial attempt, I will close it now and see if we can reopen 
it later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to