Re: Question about Producer->Broker load balancing

Jun Rao Wed, 30 Nov 2011 14:03:47 -0800

Florian,

Our highlevel consumer api will divide partitions roughly evenly among
consumers in a group. Each consumer will then get a non-overlapping portion
of messages in a topic. At this moment, you can't tell which partition a
message comes from. Other than that, you can choose to publish the
aggregated results to another topic.


Jun

On Wed, Nov 30, 2011 at 12:18 PM, Florian Leibert <f...@leibert.de> wrote:

> Thanks for the responses.
>
>
> @Taylor (or someone else who can answer this):
>
> So you say I need a high-level consumer - I assume that can I have
> multiple high level consumers that do aggregations on a partition
> basis and then produce the aggregates under a different topic?
>
> Essentially I want to have a hierarchical system of messages and make
> sure that also the consumers balance the load b/c a single consumer
> couldn't take that much traffic...
>
>
> Thanks!
>
> Florian
>
>
>
>
> 0.6 and onward should support load balancing by the use of partitions.  It
> is required that you use a high-level consumer that can cooperate properly
> with Kafka and Zookeeper since the partitions are created and managed
> dynamically and can be re-balanced.
>
> It is important to note that order cannot be guaranteed for messages that
> span partition boundaries, and for this reason it is possible to use a
> semantic partition key that can route messages based on their data to a
> particular partition, so that while the whole stream might be out of order
> wrt to messages across partitions, for a particular kind of message within
> the stream (let's say by hostname or userid or ticker symbol) that is fixed
> to a given partition the data will be in order.
>
> This architecture implies that every topic is created everywhere, so
> supposing you have 3 machines and 10 partitions, you might end up with 4
> partitions of topic A on machine 1, 3 on machine 2, and 3 on machine 3.
>
> Depending on your use case, this might be sufficient.
>
> For my purposes, it was not.  My first use case uses Kafka in a slightly
> twisted way (not exactly how it was intended to be used) in that I have
> many topics (hundreds of thousands) that receive low traffic on a per topic
> basis but high throughput in aggregate.
>
> Since Kafka stores all topics in a single subdirectory, it is infeasible to
> have 100k+ topics in a single directory.  I managed this by creating a
> separate tier which receives UDP packets as messages, and forwards them on
> to Kafka using a hashring to decide which Kafka instance to send the
> message to.  Each Kafka instance is silo'd - that is it doesn't know about
> the other Kafka instances.  If it did, it would try to balance the
> partitions across the machines which I don't want.
>
> This gives me good scalability as the number of topics increase.  My
> overall throughput however is somewhat diminished because I am forcing
> Kafka to fsync lots and lots of files (and I want low latency, another
> thing Kafka wasn't exactly designed for) -- but overall I am still managing
> to get great numbers that are more than sufficient for my current needs.
>
> You might ask - why UDP?  Well, the standard Kafka protocol uses TCP which
> is fine for a long running process that is say monitoring syslog logs and
> pumping them into Kafka.  It's not at all appropriate for a short-lived
> process - like let's say PHP.  Hmmm.  :)  So for my architecture, it was
> necessary to implement a UDP listener that would accept incoming messages ,
> collect them, and send them to the Kafka tier.
>
> The advantage to this architecture is that only the UDP listener tier has
> to know the exact Kafka cluster topology.  Everyone else just sends off a
> fire and forget UDP packet.
>
> Hope this was helpful (and not too confusing).
>
> I intend to write up my use case and some of the solutions we invented in
> bringing it to life in the near future.
>
> On Tue, Nov 29, 2011 at 6:01 PM, Florian Leibert <f...@leibert.de> wrote:
>
> > Hi -
> > reading through the design doc of Kafka I read that it doesn't support
> LB:
> >
> > Currently, there is no built-in load balancing between the producers and
> > the brokers in Kafka; in our own usage we publish from a large number of
> > heterogeneous machines and so it is desirable that the publisher not need
> > any explicit knowledge of the cluster topology. We rely on a hardware
> load
> > balancer to distribute the producer load across multiple brokers. We will
> > consider adding this in a future release to allow semantic partitioning
> of
> > messages (i.e. publishing all messages to a particular broker based on
> some
> > id to ensure an ordered stream of updates within that id).
> >
> > However it says further down:
> > In v0.6, we introduced built-in automatic load balancing between the
> > producers and the brokers in Kafka; Currently, in our own usage we
> publish
> > from a large number of heterogeneous machines and so it is
> >
> >
> > I assume this is just old documentation however looking through the code
> > (on master) I couldn't find any unit tests for load balancing so I am
> > actually not sure it is supported... Has anyone used it, should it exist?
> >
> > Thanks
> >
> > Florian
> >
>

Re: Question about Producer->Broker load balancing

Reply via email to