Florian, Our highlevel consumer api will divide partitions roughly evenly among consumers in a group. Each consumer will then get a non-overlapping portion of messages in a topic. At this moment, you can't tell which partition a message comes from. Other than that, you can choose to publish the aggregated results to another topic.
Jun On Wed, Nov 30, 2011 at 12:18 PM, Florian Leibert <f...@leibert.de> wrote: > Thanks for the responses. > > > @Taylor (or someone else who can answer this): > > So you say I need a high-level consumer - I assume that can I have > multiple high level consumers that do aggregations on a partition > basis and then produce the aggregates under a different topic? > > Essentially I want to have a hierarchical system of messages and make > sure that also the consumers balance the load b/c a single consumer > couldn't take that much traffic... > > > Thanks! > > Florian > > > > > 0.6 and onward should support load balancing by the use of partitions. It > is required that you use a high-level consumer that can cooperate properly > with Kafka and Zookeeper since the partitions are created and managed > dynamically and can be re-balanced. > > It is important to note that order cannot be guaranteed for messages that > span partition boundaries, and for this reason it is possible to use a > semantic partition key that can route messages based on their data to a > particular partition, so that while the whole stream might be out of order > wrt to messages across partitions, for a particular kind of message within > the stream (let's say by hostname or userid or ticker symbol) that is fixed > to a given partition the data will be in order. > > This architecture implies that every topic is created everywhere, so > supposing you have 3 machines and 10 partitions, you might end up with 4 > partitions of topic A on machine 1, 3 on machine 2, and 3 on machine 3. > > Depending on your use case, this might be sufficient. > > For my purposes, it was not. My first use case uses Kafka in a slightly > twisted way (not exactly how it was intended to be used) in that I have > many topics (hundreds of thousands) that receive low traffic on a per topic > basis but high throughput in aggregate. > > Since Kafka stores all topics in a single subdirectory, it is infeasible to > have 100k+ topics in a single directory. I managed this by creating a > separate tier which receives UDP packets as messages, and forwards them on > to Kafka using a hashring to decide which Kafka instance to send the > message to. Each Kafka instance is silo'd - that is it doesn't know about > the other Kafka instances. If it did, it would try to balance the > partitions across the machines which I don't want. > > This gives me good scalability as the number of topics increase. My > overall throughput however is somewhat diminished because I am forcing > Kafka to fsync lots and lots of files (and I want low latency, another > thing Kafka wasn't exactly designed for) -- but overall I am still managing > to get great numbers that are more than sufficient for my current needs. > > You might ask - why UDP? Well, the standard Kafka protocol uses TCP which > is fine for a long running process that is say monitoring syslog logs and > pumping them into Kafka. It's not at all appropriate for a short-lived > process - like let's say PHP. Hmmm. :) So for my architecture, it was > necessary to implement a UDP listener that would accept incoming messages , > collect them, and send them to the Kafka tier. > > The advantage to this architecture is that only the UDP listener tier has > to know the exact Kafka cluster topology. Everyone else just sends off a > fire and forget UDP packet. > > Hope this was helpful (and not too confusing). > > I intend to write up my use case and some of the solutions we invented in > bringing it to life in the near future. > > On Tue, Nov 29, 2011 at 6:01 PM, Florian Leibert <f...@leibert.de> wrote: > > > Hi - > > reading through the design doc of Kafka I read that it doesn't support > LB: > > > > Currently, there is no built-in load balancing between the producers and > > the brokers in Kafka; in our own usage we publish from a large number of > > heterogeneous machines and so it is desirable that the publisher not need > > any explicit knowledge of the cluster topology. We rely on a hardware > load > > balancer to distribute the producer load across multiple brokers. We will > > consider adding this in a future release to allow semantic partitioning > of > > messages (i.e. publishing all messages to a particular broker based on > some > > id to ensure an ordered stream of updates within that id). > > > > However it says further down: > > In v0.6, we introduced built-in automatic load balancing between the > > producers and the brokers in Kafka; Currently, in our own usage we > publish > > from a large number of heterogeneous machines and so it is > > > > > > I assume this is just old documentation however looking through the code > > (on master) I couldn't find any unit tests for load balancing so I am > > actually not sure it is supported... Has anyone used it, should it exist? > > > > Thanks > > > > Florian > > >