For an alternative take on a similar idea, see https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka
An advantage of the approach I'm taking is that the lower and upper offsets of the RDD are known in advance, so it's deterministic. I haven't had a need to write to kafka from spark yet, so that's an obvious advantage of your library. I think the existing kafka dstream is inadequate for a number of use cases, and would really like to see some combination of these approaches make it into the spark codebase. On Sun, Dec 14, 2014 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote: > > hello all, > we at tresata wrote a library to provide for batch integration between > spark and kafka (distributed write of rdd to kafa, distributed read of rdd > from kafka). our main use cases are (in lambda architecture jargon): > * period appends to the immutable master dataset on hdfs from kafka using > spark > * make non-streaming data available in kafka with periodic data drops from > hdfs using spark. this is to facilitate merging the speed and batch layer > in spark-streaming > * distributed writes from spark-streaming > > see here: > https://github.com/tresata/spark-kafka > > best, > koert >