We've worked on this problem at Tagged. In the current implementation a current Kafka cluster can handle around 20k simultaneous topics. This is using a RHEL 5 machine with EXT3 backed filesystem, and the bottleneck is mostly in the filesystem itself (due to the fact that Kafka stores all topics in a single directory).
We have implemented a fast cleaner that deletes topics that are no longer used. This cleaner really wipes the topic, instead of leaving a file/directory around as the normal cleaner does. Since this resets the topic offset back to 0, you have to make sure your clients can handle this (however, in theory, your clients already need to handle this situation since it is a normal part of Kafka should the topic offset wrap around the 64-bit value, though in practice it probably never happens). The other thing of note is that this number of topics with zookeeper turned on makes startup intolerably slow as Kafka goes through a validation process. Therefore we also have turned off zookeeper. To allow us to handle a high number of topics in this configuration, we have written a custom sharding/routing on top of Kafka. The layers look like this: +----------------------------------------------------------------------------+ | Pub/sub (implements custom sharding and shields from specifics of Kafka) | +----------------------------------------------------------------------------+ | Kafka | +----------------------------------------------------------------------------+ Currently we are handling concurrently on the order of 100k topics. We plan to 10x that in the near future. To do so we are planning on implementing the following: - SSD backed disks (vs. the current eSATA backed machines) - a topic hierarchy (to alleviate the 20k bottleneck Note that we will continue with our current sharded solution, because adding Zookeeper back in would likely cause a lot of bottlenecks, not just with the startup (all topics are stored in a flat hiearchy in Zookeeper too, so it's likely this might not do well at high scale). Hope that helps! On Mon, Apr 9, 2012 at 11:04 AM, Eric Tschetter <eched...@gmail.com> wrote: > Hi guys, > > I'm wondering about experiences with a large number of feeds created > and managed on a single Kafka cluster. Specifically, if anyone can > share information about how many different feeds they have on their > kafka cluster and overall throughput, that'd be cool. > > Some background: I'm planning on setting up a system around Kafka that > will (hopefully, eventually) have >10,000 feeds in parallel. I expect > event volume on these feeds to follow a zipfian distribution. So, > there will be a long-tail of smaller feeds and some large ones, but > there will be consumers for each of these feeds. I'm trying to decide > between relying on Kafka's feeds to maintain the separation between > the data streams, or if I should actually create one large aggregate > feed and utilize Kafka's partitioning mechanisms along with some > custom logic to keep the feeds separated. I prefer to use Kafka's > built-in feed mechanisms, cause there are significant benefits to > that, but I can also imagine a world where that many feeds was not in > the base assumptions of how the system would be used and thus > questionable around performance. > > Any input is appreciated. > > --Eric >