Juan Rodríguez Hortalá created SPARK-6714:
---------------------------------------------
Summary: additionally overload KafkaUtils.createDirectStream for
using a messageHandler without having to specify the offsets
Key: SPARK-6714
URL: https://issues.apache.org/jira/browse/SPARK-6714
Project: Spark
Issue Type: Improvement
Components: Streaming
Affects Versions: 1.3.0
Reporter: Juan Rodríguez Hortalá
Priority: Trivial
Currently, in the Scala API, KafkaUtils.createDirectStream has two overload
methods, one for an "easy mode" where only the Kafka parameters and topics are
specified, and other "hard mode" where we can specify the offsets and a
messageHandler for manipulating the MessageAndMetadata values obtained from
Kafka. I think an intermediate method that automatically handles the offsets,
but that allows you to specify the messageHandler would be very useful. For
example the triple (topic, partition, offset) uniquely identifies each message,
so that could be useful as primary key for idempotent insertions in an external
database. Also, in an implementation of a lambda architecture, offsets could be
used to trace which part of a kafka topic has been covered by the batch view,
and which part by the real-time / live view. For both cases I think that having
access to the meta information through a messageHandler, while maintaining
managed offsets, would be interesting
A proposal for the implementation is available at
https://github.com/juanrh/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala.
A new overload of KafkaUtils.createDirectStream for the Scala API, and another
for the Java API, have been added
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]