Florian,

>> I assume this is just old documentation however looking through the code
(on master) I couldn't find any unit tests for load balancing so I am
actually not sure it is supported... Has anyone used it, should it exist?

You are right. The earlier part of the documentation is outdated. We will
fix this. Having said that,
since 0.6 release, Kafka does support in-built load balancing of produce
requests across brokers. This requires no knowledge of
cluster topology and the default partitioning strategy is a simple
hash(key) % num_partitions.

This in-built load balancing can also be used for semantic partitioning.
For example, if you want all updates for a particular record_id
to be consumed in order, you will use the record_id as the key in the
producer. This ensures that all events with that record_id are
published to the same broker partition and can be consumed in order by the
consumer.

However, if you choose not to use a key, the partitioning strategy picks a
random broker partition to publish each produce request.

Thanks,
Neha

On Wed, Nov 30, 2011 at 9:10 AM, Taylor Gautier <tgaut...@tagged.com> wrote:

> 0.6 and onward should support load balancing by the use of partitions.  It
> is required that you use a high-level consumer that can cooperate properly
> with Kafka and Zookeeper since the partitions are created and managed
> dynamically and can be re-balanced.
>
> It is important to note that order cannot be guaranteed for messages that
> span partition boundaries, and for this reason it is possible to use a
> semantic partition key that can route messages based on their data to a
> particular partition, so that while the whole stream might be out of order
> wrt to messages across partitions, for a particular kind of message within
> the stream (let's say by hostname or userid or ticker symbol) that is fixed
> to a given partition the data will be in order.
>
> This architecture implies that every topic is created everywhere, so
> supposing you have 3 machines and 10 partitions, you might end up with 4
> partitions of topic A on machine 1, 3 on machine 2, and 3 on machine 3.
>
> Depending on your use case, this might be sufficient.
>
> For my purposes, it was not.  My first use case uses Kafka in a slightly
> twisted way (not exactly how it was intended to be used) in that I have
> many topics (hundreds of thousands) that receive low traffic on a per topic
> basis but high throughput in aggregate.
>
> Since Kafka stores all topics in a single subdirectory, it is infeasible to
> have 100k+ topics in a single directory.  I managed this by creating a
> separate tier which receives UDP packets as messages, and forwards them on
> to Kafka using a hashring to decide which Kafka instance to send the
> message to.  Each Kafka instance is silo'd - that is it doesn't know about
> the other Kafka instances.  If it did, it would try to balance the
> partitions across the machines which I don't want.
>
> This gives me good scalability as the number of topics increase.  My
> overall throughput however is somewhat diminished because I am forcing
> Kafka to fsync lots and lots of files (and I want low latency, another
> thing Kafka wasn't exactly designed for) -- but overall I am still managing
> to get great numbers that are more than sufficient for my current needs.
>
> You might ask - why UDP?  Well, the standard Kafka protocol uses TCP which
> is fine for a long running process that is say monitoring syslog logs and
> pumping them into Kafka.  It's not at all appropriate for a short-lived
> process - like let's say PHP.  Hmmm.  :)  So for my architecture, it was
> necessary to implement a UDP listener that would accept incoming messages ,
> collect them, and send them to the Kafka tier.
>
> The advantage to this architecture is that only the UDP listener tier has
> to know the exact Kafka cluster topology.  Everyone else just sends off a
> fire and forget UDP packet.
>
> Hope this was helpful (and not too confusing).
>
> I intend to write up my use case and some of the solutions we invented in
> bringing it to life in the near future.
>
> On Tue, Nov 29, 2011 at 6:01 PM, Florian Leibert <f...@leibert.de> wrote:
>
> > Hi -
> > reading through the design doc of Kafka I read that it doesn't support
> LB:
> >
> > Currently, there is no built-in load balancing between the producers and
> > the brokers in Kafka; in our own usage we publish from a large number of
> > heterogeneous machines and so it is desirable that the publisher not need
> > any explicit knowledge of the cluster topology. We rely on a hardware
> load
> > balancer to distribute the producer load across multiple brokers. We will
> > consider adding this in a future release to allow semantic partitioning
> of
> > messages (i.e. publishing all messages to a particular broker based on
> some
> > id to ensure an ordered stream of updates within that id).
> >
> > However it says further down:
> > In v0.6, we introduced built-in automatic load balancing between the
> > producers and the brokers in Kafka; Currently, in our own usage we
> publish
> > from a large number of heterogeneous machines and so it is
> >
> >
> > I assume this is just old documentation however looking through the code
> > (on master) I couldn't find any unit tests for load balancing so I am
> > actually not sure it is supported... Has anyone used it, should it exist?
> >
> > Thanks
> >
> > Florian
> >
>

Reply via email to