Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/3798#issuecomment-68407841
@koeninger I like the high-level approach of the Kafka RDD and the
DeterministicKafkaInputDStream. However, the stuff is not tolerant to driver
failures yet. This is because of the following. If the driver fails and needs
to be recovered from the checkpoints (by deserializing the DAG of DStream),
then the RDDs of previous batches needs to be recreated (say last batches over
previous 5 minutes, if 5 minutes window operations are used). Since the offsets
used in the previous batches are not serialized with the DStream, the
corresponding RDDs cannot be recreated. This problem can be alleviated pretty
easily by actually making the DStream save a map of Time --> offsets. See
FileInputDStream to understand how this can be done. I can talk offline about
this to explain the stuff.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]