Juan Rodríguez Hortalá created SPARK-6714:
---------------------------------------------

             Summary: additionally overload KafkaUtils.createDirectStream for 
using a messageHandler without having to specify the offsets
                 Key: SPARK-6714
                 URL: https://issues.apache.org/jira/browse/SPARK-6714
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 1.3.0
            Reporter: Juan Rodríguez Hortalá
            Priority: Trivial


Currently, in the Scala API, KafkaUtils.createDirectStream has two overload 
methods, one for an "easy mode" where only the Kafka parameters and topics are 
specified, and other "hard mode" where we can specify the offsets and a 
messageHandler for manipulating the MessageAndMetadata values obtained from 
Kafka. I think an intermediate method that automatically handles the offsets, 
but that allows you to specify the messageHandler would be very useful. For 
example the triple (topic, partition, offset) uniquely identifies each message, 
so that could be useful as primary key for idempotent insertions in an external 
database. Also, in an implementation of a lambda architecture, offsets could be 
used to trace which part of a kafka topic has been covered by the batch view, 
and which part by the real-time / live view. For both cases I think that having 
access to the meta information through a messageHandler, while maintaining 
managed offsets, would be interesting

A proposal for the implementation is available at 
https://github.com/juanrh/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala.
 A new overload of KafkaUtils.createDirectStream for the Scala API, and another 
for the Java API, have been added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to