Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming

Dibyendu Bhattacharya Mon, 22 Sep 2014 01:22:41 -0700

Hi,

Last few days I am working on a Spark - Apache Blur Connector to index
Kafka messages into Apache Blur using Spark Streaming. We have been working
on to build a distributed search platform for our NRT use cases and we have
been playing with Spark Streaming and Apache Blur for the same. We are
presently working on Apache Blur and here is a Spark Connector I would like
to share with community to get a feedback for this.


This Connector uses the Low Level Kafka Consumer which I had written few
weeks back (https://github.com/dibbhatt/kafka-spark-consumer). There was a
separate thread on this Kafka Consumer in Spark group.

Even though I was able to index Kafka messages using this low level
consumer via Apache Blur Queuing API , I wanted to try out the Spark
saveAsHadoop* API which can perform bulk loading of RDD into Apache Blur.

For that I have written this Blur Connector for Spark (
https://github.com/dibbhatt/spark-blur-connector).

This connector uses the same Kafka Low level consumer which I mentioned
above, and partition the RDD which is same as number of Shards for target
Blur Table. For this I had to use a Custom Partitioner logic so that
Partition of Keys in RDD is same as Partition of Keys into Targte Blur
Shard.

I also implemented a Custom BlurOutputFormat  to return
the BlurOutputCommitter which use the new Hadoop api
(org.apache.hadoop.mapreduce).

There are few minor changes I did in existing GenericBlurRecordWriter
and BlurOutputCommitter and used modified RecordWriter and OutputCommiter
for this Spark Blur connector. If those minor issues are fixed in Apache
blur, no need to use these custom code .


Have tested this connector to index activity streams coming to Kafka
cluster,  and it nicely index Kafka messages into Target Apache Blur tables.

Would love to hear what you think. I have copied both Apache Blur and Spark
community..


Regards,
Dibyendu

Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming

Reply via email to