For an alternative take on a similar idea, see

https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka

An advantage of the approach I'm taking is that the lower and upper offsets
of the RDD are known in advance, so it's deterministic.

I haven't had a need to write to kafka from spark yet, so that's an obvious
advantage of your library.

I think the existing kafka dstream is inadequate for a number of use cases,
and would really like to see some combination of these approaches make it
into the spark codebase.


On Sun, Dec 14, 2014 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> hello all,
> we at tresata wrote a library to provide for batch integration between
> spark and kafka (distributed write of rdd to kafa, distributed read of rdd
> from kafka). our main use cases are (in lambda architecture jargon):
> * period appends to the immutable master dataset on hdfs from kafka using
> spark
> * make non-streaming data available in kafka with periodic data drops from
> hdfs using spark. this is to facilitate merging the speed and batch layer
> in spark-streaming
> * distributed writes from spark-streaming
>
> see here:
> https://github.com/tresata/spark-kafka
>
> best,
> koert
>

Reply via email to