0.6 and onward should support load balancing by the use of partitions.  It
is required that you use a high-level consumer that can cooperate properly
with Kafka and Zookeeper since the partitions are created and managed
dynamically and can be re-balanced.

It is important to note that order cannot be guaranteed for messages that
span partition boundaries, and for this reason it is possible to use a
semantic partition key that can route messages based on their data to a
particular partition, so that while the whole stream might be out of order
wrt to messages across partitions, for a particular kind of message within
the stream (let's say by hostname or userid or ticker symbol) that is fixed
to a given partition the data will be in order.

This architecture implies that every topic is created everywhere, so
supposing you have 3 machines and 10 partitions, you might end up with 4
partitions of topic A on machine 1, 3 on machine 2, and 3 on machine 3.

Depending on your use case, this might be sufficient.

For my purposes, it was not.  My first use case uses Kafka in a slightly
twisted way (not exactly how it was intended to be used) in that I have
many topics (hundreds of thousands) that receive low traffic on a per topic
basis but high throughput in aggregate.

Since Kafka stores all topics in a single subdirectory, it is infeasible to
have 100k+ topics in a single directory.  I managed this by creating a
separate tier which receives UDP packets as messages, and forwards them on
to Kafka using a hashring to decide which Kafka instance to send the
message to.  Each Kafka instance is silo'd - that is it doesn't know about
the other Kafka instances.  If it did, it would try to balance the
partitions across the machines which I don't want.

This gives me good scalability as the number of topics increase.  My
overall throughput however is somewhat diminished because I am forcing
Kafka to fsync lots and lots of files (and I want low latency, another
thing Kafka wasn't exactly designed for) -- but overall I am still managing
to get great numbers that are more than sufficient for my current needs.

You might ask - why UDP?  Well, the standard Kafka protocol uses TCP which
is fine for a long running process that is say monitoring syslog logs and
pumping them into Kafka.  It's not at all appropriate for a short-lived
process - like let's say PHP.  Hmmm.  :)  So for my architecture, it was
necessary to implement a UDP listener that would accept incoming messages ,
collect them, and send them to the Kafka tier.

The advantage to this architecture is that only the UDP listener tier has
to know the exact Kafka cluster topology.  Everyone else just sends off a
fire and forget UDP packet.

Hope this was helpful (and not too confusing).

I intend to write up my use case and some of the solutions we invented in
bringing it to life in the near future.

On Tue, Nov 29, 2011 at 6:01 PM, Florian Leibert <f...@leibert.de> wrote:

> Hi -
> reading through the design doc of Kafka I read that it doesn't support LB:
>
> Currently, there is no built-in load balancing between the producers and
> the brokers in Kafka; in our own usage we publish from a large number of
> heterogeneous machines and so it is desirable that the publisher not need
> any explicit knowledge of the cluster topology. We rely on a hardware load
> balancer to distribute the producer load across multiple brokers. We will
> consider adding this in a future release to allow semantic partitioning of
> messages (i.e. publishing all messages to a particular broker based on some
> id to ensure an ordered stream of updates within that id).
>
> However it says further down:
> In v0.6, we introduced built-in automatic load balancing between the
> producers and the brokers in Kafka; Currently, in our own usage we publish
> from a large number of heterogeneous machines and so it is
>
>
> I assume this is just old documentation however looking through the code
> (on master) I couldn't find any unit tests for load balancing so I am
> actually not sure it is supported... Has anyone used it, should it exist?
>
> Thanks
>
> Florian
>

Reply via email to