I think this is a good point. I think it might be good to just take the .net patch now if it works, but i don't think that is a great long-term strategy. The consumer is too complicated right now to maintain it simultaneously in every language.
Let me give a couple of options I don't think are great: 1. Using thrift or protocol buffers doesn't really solve the problem. The problem we have isn't our serialized wire format it is the complex code needed in the consumer to give a simplified API to the user. Even if we used protocol buffers it wouldn't really help thin the client, it would just replace our request/response objects with new generated ones and add a new runtime dependency for users. We did this for voldemort by adding optional protocol buffer definitions for all the requests and in the end no one used protocol buffers because a simple wire format was just as easy to use and protocol buffers supported so few languages. Changing our request/response objects might be a good idea for other reasons, it just doesn't help with this. 2. Trying to adopt an off-the-shelf protocol. We really do want to give good performance and reasonable distributed semantics. I looked at a few of these generic protocols and I think they essentially imply a particular implementation. Actually I think they imply the union of the implementations of all the systems involved in the standardization process :-) Worse I think they don't really solve the harder problems of balancing consumer load. One option that would be okay would be creating a simple RESTFUL proxy that consumed into a buffer, handed out the messages in the buffer on request and committed the offset at the same time. This is inefficient, and would not allow semantic partitioning, but would make the simple case simple and is easy to implement as a standalone module. For a slightly more long-term solution, here is my thought: The fat scala/jvm client does a couple of hard things: (1) co-ordination, (2), multithreading, including a non-blocking fetch with a queue in the middle to buffer consumption, and (3) handling many topics and/or multithreaded consumption from the same socket and fetcher pool. Here is how we could remove the co-ordination zookeeper code from the client. What if we just made a PartitionAssignmentRequest and moved that whole chunk of logic about who gets what to the server? What if instead of the clients co-ordinating to choose consumers, the clients just registered themselves in zookeeper, watched the other consumers and brokers, and responded to any state change by re-requesting partition assignments from the servers? Actually you might be able to simplify this further by just having the consumers register, and having the server disconnect them whenever they need partition assignments (at which point they would respond by getting new assignments and connecting to those). The brokers would pick a master, which would put its node id in zk, and consumers would use this to make the request for partitions. I think centralizing the logic might make fancier locality-aware partition assignments easier to implement and debug too. Now the only requirement on the client is that it be able to register in zookeeper, which should be pretty easy in most languages (C, python, etc). The next two "hard" pieces I think that is easy, just don't do them. I think our current consumer is very good, and a good fit for the thread-centric model of the jvm, but for non jvm langs just single threading the consumer is preferable. I think a better api for these cases would be a non-blocking select on all the open broker connections which is done inline as part of the iterator (e.g. when you are computing next() if you are out of data, just check your socket buffers). Since the select is non-blocking doing this inline should not be an issue at all. I think the best approach would be to implement this in C and then just wrap it for other languages, but doing it in python or ruby or whatever would likely be a pretty small amount of code too. We would not try to share a single connection per node, we would just have one per topic-iterator, which is not ideal but probably fine and greatly simplifies everything. Cheers, -Jay On Wed, Sep 7, 2011 at 9:21 AM, Jun Rao <[email protected]> wrote: > KAFKA-85 raised a good question: what's the right approach to support > client > bindings for languages other than java? I don't have a perfect answer and > would like to start a discussion and let everybody weigh in. > > The approach that KAFKA-85 took is to re-write all the logic in our fat > client (both the producer and the consumer) in C#. This means that a lot of > code has to be re-written and maintained and it's a lot of work if every > language does the same thing. > > There are 2 other approaches that some Apache projects have used to support > different language binding. The first one is to use an RPC code generator > to > directly expose the api to other languages. For example, Cassandra uses > Thrift to define the client API and let Thrift generate language specific > client code to talk to server. This approach works well for thin clients. > In > Cassandra, the client only does serializing/deserializing > requests/responses > and the complex routing logic is on the server. This approach may not work > well with Kafka since our client is relatively fat (lots of code in > handling > both the produce and the consume request in the client library). > > The second approach is to have a gateway. For example, HBase also has a > relatively fat java client. To support other languages, it exposes its api > indirectly in a java gateway. The gateway api is compiled into different > languages using Thrift. The generated gateway client code is thin and all > the complicated routing logic is in the gateway itself. The downside is > that > this adds the complexity of maintaining the gateway and adds one extra hop > between the client and the server. Setting these 2 concerns aside, this > approach probably works well with Kafka producers. However, it's not clear > how this works with the consumers since they get data continuously. > > Does anyone know how other queuing systems (activemq, rabbitmq, etc) > support > non-java clients? > > Thanks, > > Jun >
