Hi, Last few days I am working on a Spark - Apache Blur Connector to index Kafka messages into Apache Blur using Spark Streaming. We have been working on to build a distributed search platform for our NRT use cases and we have been playing with Spark Streaming and Apache Blur for the same. We are presently working on Apache Blur and here is a Spark Connector I would like to share with community to get a feedback for this.
This Connector uses the Low Level Kafka Consumer which I had written few weeks back (https://github.com/dibbhatt/kafka-spark-consumer). There was a separate thread on this Kafka Consumer in Spark group. Even though I was able to index Kafka messages using this low level consumer via Apache Blur Queuing API , I wanted to try out the Spark saveAsHadoop* API which can perform bulk loading of RDD into Apache Blur. For that I have written this Blur Connector for Spark ( https://github.com/dibbhatt/spark-blur-connector). This connector uses the same Kafka Low level consumer which I mentioned above, and partition the RDD which is same as number of Shards for target Blur Table. For this I had to use a Custom Partitioner logic so that Partition of Keys in RDD is same as Partition of Keys into Targte Blur Shard. I also implemented a Custom BlurOutputFormat to return the BlurOutputCommitter which use the new Hadoop api (org.apache.hadoop.mapreduce). There are few minor changes I did in existing GenericBlurRecordWriter and BlurOutputCommitter and used modified RecordWriter and OutputCommiter for this Spark Blur connector. If those minor issues are fixed in Apache blur, no need to use these custom code . Have tested this connector to index activity streams coming to Kafka cluster, and it nicely index Kafka messages into Target Apache Blur tables. Would love to hear what you think. I have copied both Apache Blur and Spark community.. Regards, Dibyendu
