Github user koeninger commented on the pull request:
https://github.com/apache/spark/pull/3798#issuecomment-68149432
Hi @jerryshao
I'd politely ask that anyone with questions read at least KafkaRDD.scala
and the example usage linked from the jira ticket (it's only about 50
significant lines of code):
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalExample.scala
I'll try to address your points.
1. Yes, each RDD partition maps directly to a Kafka (topic, partition,
inclusive starting offset, exclusive ending offset)
2. It's a pull model, not a receiver push model. All the InputDStream
implementation is doing is checking the leaders' highest offsets and defining
an RDD based on that. When the RDD is run, its iterator makes a connection to
kafka and pulls the data. This is done because it's simpler, and because using
existing network receiver code would require dedicating 1 core per kafka
partition, which is unacceptable from an ops standpoint.
3. Yes. The fault tolerance model is that it should be safe for any or all
of the spark machines to be completely destroyed at any point in the job, and
the job should be able to be safely restarted. I don't think you can do better
than this. This is achieved because all important state, especially the
storage of offsets, are controlled by client code, not spark. In both the
transactional and idempotent client code approaches, offsets aren't stored
until data is stored, so restart should be safe.
Regarding your approach that you link, the problem there is (a) it's not a
part of the spark distribution so people won't know about it, and (b) it
assumes control of kafka offsets and storage in zookeeper, which makes it
impossible for client code to control exactly once semantics.
Regarding the possible semantic disconnect between spark streaming and
treating kafka as a durable store of data from the past (assuming that's what
you meant)... I agree there is a disconnect there. But it's a fundamental
problem with spark streaming in that it implicitly depends on "now" rather than
a time embedded in the data stream. I don't think we're fixing that with this
ticket.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]