[jira] [Created] (SPARK-4964) Exactly-once semantics for Kafka

Cody Koeninger (JIRA) Wed, 24 Dec 2014 23:40:54 -0800

Cody Koeninger created SPARK-4964:
-------------------------------------

             Summary: Exactly-once semantics for Kafka
                 Key: SPARK-4964
                 URL: https://issues.apache.org/jira/browse/SPARK-4964
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
            Reporter: Cody Koeninger



for background, see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Which-committers-care-about-Kafka-td9827.html

Requirements:

- allow client code to implement exactly-once end-to-end semantics for Kafka 
messages, in cases where their output storage is either idempotent or 
transactional
- allow client code access to Kafka offsets, rather than automatically 
committing them
- do not assume Zookeeper as a repository for offsets (for the transactional 
case, offsets need to be stored in the same store as the data)
- allow failure recovery without lost or duplicated messages, even in cases 
where a checkpoint cannot be restored (for instance, because code must be 
updated)

Design:

The basic idea is to make an rdd where each partition corresponds to a given 
Kafka topic, partition, starting offset, and ending offset.  That allows for 
deterministic replay of data from Kafka (as long as there is enough log 
retention).

Client code is responsible for committing offsets, either transactionally to 
the same store that data is being written to, or in the case of idempotent 
data, after data has been written.

PR of a sample implementation for both the batch and dstream case is 
forthcoming.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-4964) Exactly-once semantics for Kafka

Reply via email to