Hi, Is there any simpler Client API will come for indexing data from Kafka to Blur ? As I mentioned below, I played with the Queue API, What I missed there was to have a partionar logic to channel the Kafka Stream to a correct Shard. Also Shard failover need to be handled if I need to implement the queue logic at Shard level.
Regards, Dibyendu On Fri, Feb 28, 2014 at 5:32 PM, Dibyendu Bhattacharya < [email protected]> wrote: > Hi, > > I was just playing with the new QueueReader API, and as Tim pointed out , > its very low level . I still tried to implement a KafkaConsumer . > > Here is my use case. Let me know if I have approached correctly. > > I have a given topic in Kafka, which has 3 Partitions. And in Blur I have > a table with 2 Shards . I need to index all messages from Kafka Topic to > Blur Table. > > I have used Kafka ConsumerGroupAPI to consume in parallel in 2 streams ( > from 3 partitions) for indexing into 2 Blur shards. As ConsumerGroup API > allow me to split any Kafka Topic into N number of streams, I can choose N > for my target shard count, here it is 2. > > For both shards I created two ShardContext and two BlurIndexSimpleWriter. > ( Is this okay ?) > > Now, I modified the BlurIndexSimpleWriter to get handle to > the _queueReader object. I used this _queueReader to populate the > respective shards queue taking messages from KafkaStreams. > > Here is the TestCase (KafkaReaderTest) , KafkaStreamReader ( which reads > the Kafka Stream) , and the KafkaQueueReader ( The Q interface for Blur) > > Also attached the modified BlurIndexSimpleWriter. Just added > > public QueueReader getQueueReader(){ > > return _queueReader; > } > > With these changes, I am able to read Kafka messages in parallel streams > and index them into 2 shards. All documents from Kafka getting indexed > properly. But after TestCases run , I can see two Index Directory for two > path I created. > > Let me know if this approach is correct ? In this code, I have not taken > care of Shard Failure logic and as Tim pointed out, if that can be > abstracted form client that will be great. > > Regards, > Dibyendu > > > > > On Thu, Feb 27, 2014 at 9:40 PM, Aaron McCurry <[email protected]> wrote: > >> What if we provide an implementation of the QueueReader concept that does >> what you are discussing. That way in more extreme cases when the user is >> forced into implementing the lower level api (perhaps for performance) >> they >> can still do it, but for the normal case the partitioning (and other >> difficult issues) are handled by the controllers. >> >> I could see adding an enqueueMutate call to the controllers that pushes >> the >> mutates to the correct buckets for the user. At the same time we could >> allow each of the controllers to pull from an external and push the >> mutates >> to the correct buckets for the shards. I could see a couple of different >> ways of handling this. >> >> However I do agree that right now there is too much burden on the user for >> the 95% case. We should make this simpler. >> >> Aaron >> >> >> On Thu, Feb 27, 2014 at 10:07 AM, Tim Williams <[email protected]> >> wrote: >> >> > I've been playing around with the new QueueReader stuff and I'm >> > starting to believe it's at the wrong level of abstraction - in the >> > shard context - for a user. >> > >> > Between having to know about the BlurPartioner and handling all the >> > failure nuances, I'm thinking a much friendlier approach would be to >> > have the client implement a single message pump that Blur take's from >> > and handles. >> > >> > Maybe on startup the Controllers compete for the lead QueueReader >> > position, create it from the TableContext and run with it? The user >> > would still need to deal with Controller failures but that seems >> > easier to reason about then shard failures. >> > >> > The way it's crafted right now, the user seems burdened with a lot of >> > the hard problems that Blur otherwise solves. Obviously, it trades >> > off a high burden for one of the controllers. >> > >> > Thoughts? >> > --tim >> > >> > >
