Cody Koeninger created SPARK-4964:
-------------------------------------
Summary: Exactly-once semantics for Kafka
Key: SPARK-4964
URL: https://issues.apache.org/jira/browse/SPARK-4964
Project: Spark
Issue Type: Improvement
Components: Streaming
Reporter: Cody Koeninger
for background, see
http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html
Requirements:
- allow client code to implement exactly-once end-to-end semantics for Kafka
messages, in cases where their output storage is either idempotent or
transactional
- allow client code access to Kafka offsets, rather than automatically
committing them
- do not assume Zookeeper as a repository for offsets (for the transactional
case, offsets need to be stored in the same store as the data)
- allow failure recovery without lost or duplicated messages, even in cases
where a checkpoint cannot be restored (for instance, because code must be
updated)
Design:
The basic idea is to make an rdd where each partition corresponds to a given
Kafka topic, partition, starting offset, and ending offset. That allows for
deterministic replay of data from Kafka (as long as there is enough log
retention).
Client code is responsible for committing offsets, either transactionally to
the same store that data is being written to, or in the case of idempotent
data, after data has been written.
PR of a sample implementation for both the batch and dstream case is
forthcoming.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]