Hi Dibyendu, If you still want to contribute the Spark code to Blur that would be awesome! I think that we would need a ICLA from you.
http://www.apache.org/licenses/icla.txt It might be a good idea to get CCLA from your company. http://www.apache.org/licenses/cla-corporate.txt Maybe someone else can help out here, will we need a Software Grant as well? Thanks! Aaron On Tue, Sep 23, 2014 at 8:00 AM, Dibyendu Bhattacharya < [email protected]> wrote: > Thanks Aaron. I would love to do that. > > Dibyendu > On Sep 23, 2014 5:17 PM, "Aaron McCurry" <[email protected]> wrote: > > > On Mon, Sep 22, 2014 at 4:21 AM, Dibyendu Bhattacharya < > > [email protected]> wrote: > > > > > Hi, > > > > > > Last few days I am working on a Spark - Apache Blur Connector to index > > > Kafka messages into Apache Blur using Spark Streaming. We have been > > working > > > on to build a distributed search platform for our NRT use cases and we > > have > > > been playing with Spark Streaming and Apache Blur for the same. We are > > > presently working on Apache Blur and here is a Spark Connector I would > > like > > > to share with community to get a feedback for this. > > > > > > This Connector uses the Low Level Kafka Consumer which I had written > few > > > weeks back (https://github.com/dibbhatt/kafka-spark-consumer). There > > was a > > > separate thread on this Kafka Consumer in Spark group. > > > > > > Even though I was able to index Kafka messages using this low level > > > consumer via Apache Blur Queuing API , I wanted to try out the Spark > > > saveAsHadoop* API which can perform bulk loading of RDD into Apache > Blur. > > > > > > For that I have written this Blur Connector for Spark ( > > > https://github.com/dibbhatt/spark-blur-connector). > > > > > > This connector uses the same Kafka Low level consumer which I mentioned > > > above, and partition the RDD which is same as number of Shards for > target > > > Blur Table. For this I had to use a Custom Partitioner logic so that > > > Partition of Keys in RDD is same as Partition of Keys into Targte Blur > > > Shard. > > > > > > I also implemented a Custom BlurOutputFormat to return > > > the BlurOutputCommitter which use the new Hadoop api > > > (org.apache.hadoop.mapreduce). > > > > > > There are few minor changes I did in existing GenericBlurRecordWriter > > > and BlurOutputCommitter and used modified RecordWriter and > OutputCommiter > > > for this Spark Blur connector. If those minor issues are fixed in > Apache > > > blur, no need to use these custom code . > > > > > > > > > Have tested this connector to index activity streams coming to Kafka > > > cluster, and it nicely index Kafka messages into Target Apache Blur > > > tables. > > > > > > Would love to hear what you think. I have copied both Apache Blur and > > Spark > > > community.. > > > > > > > I think this is awesome! I have read through the code but I still need > to > > get it running to put it through it's paces. :-) Do you think that you > > would want to contribute this code into Blur? > > > > Aaron > > > > > > > > > > > > > Regards, > > > Dibyendu > > > > > >
