[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

tdas Tue, 30 Dec 2014 14:54:33 -0800

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/3798#issuecomment-68407841
  
    @koeninger I like the high-level approach of the Kafka RDD and the 
DeterministicKafkaInputDStream. However, the stuff is not tolerant to driver 
failures yet. This is because of the following. If the driver fails and needs 
to be recovered from the checkpoints (by deserializing the DAG of DStream), 
then the RDDs of previous batches needs to be recreated (say last batches over 
previous 5 minutes, if 5 minutes window operations are used). Since the offsets 
used in the previous batches are not serialized with the DStream, the 
corresponding RDDs cannot be recreated. This problem can be alleviated pretty 
easily by actually making the DStream save a map of Time --> offsets. See 
FileInputDStream to understand how this can be done. I can talk offline about 
this to explain the stuff.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4964] [Streaming] Exactly-once semantic...

Reply via email to