Yes the footprint of a topic is one directory per partition (a topic can
have many subpartitions per partitions). Each directory contains one or
more files (depending on how much data you are retaining and the segment
size, both configurable).

In addition to having lots of open files, which certainly scales up to the
hundreds of thousands, this will also impact the I/O pattern. As the number
of files increases the data written to each file necessarily decreases.
This likely means lots of random I/O. The OS can group together writes, but
if you only doing a single write per topic every now and then there will be
nothing to group and you will lots of small random I/O. This will
definitely impact throughput. I don't know where the practical limits are
we have tested up to ~500 topics and see reasonable performance. We have
not done serious performance testing with tens of thousands of topics or
more.

In addition to the filesystem concerns there is metadata kept for each
partition in zk, and I believe zk keeps this metadata in memory.

-Jay

On Wed, Oct 10, 2012 at 4:12 PM, Jason Rosenberg <j...@squareup.com> wrote:

> Ok,
>
> Perhaps for the sake of argument, consider the question if we have just 1
> kafka broker.  It sounds like it will need to keep a file handle open for
> each topic?  Is that right?
>
> Jason
>
> On Wed, Oct 10, 2012 at 4:05 PM, Neha Narkhede <neha.narkh...@gmail.com
> >wrote:
>
> > Hi Jason,
> >
> > We use option #2 at LinkedIn for metrics and tracking data. Supporting
> > Option #1 in Kafka 0.7 has its challenges since every topic is stored
> > on every broker, by design. Hence, the number of topics a cluster can
> > support is limited by the IO and number of open file handles on each
> > broker. After Kafka 0.8 is released, the distribution of topics to
> > brokers is user defined and can scale out with the number of brokers.
> > Having said that, some Kafka users have successfully deployed Kafka
> > 0.7 clusters hosting very high number of topics. I hope they can share
> > their experiences here.
> >
> > Thanks,
> > Neha
> >
> > On Wed, Oct 10, 2012 at 3:57 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> > > Hi,
> > >
> > > I'm exploring using kafka for the first time.
> > >
> > > I'm contemplating a system where we transmit metric data at regular
> > > intervals to kafka.  One question I have is whether to generate simple
> > > messages with very little meta data (just timestamp and value), and
> > keeping
> > > meta data like the name/host/app that generated metric out of the
> > message,
> > > and have that be embodied in the name of the topic itself instead.
> > >  Alternatively, we could have a relatively small number of topics,
> which
> > > contain messages which include source meta data along with the
> timestamp
> > > and metric value in each message.
> > >
> > > 1. On one hand, we'd have a large number of topics (say several hundred
> > > thousand topics) with small messages, generated at a steady rate (say
> one
> > > every 10 seconds).
> > >
> > > 2. Alternatively, we could have just few topics, which receive several
> > > hundred thousand messages every 10 seconds, which contain 2 or 3 times
> > more
> > > data per message.
> > >
> > > I'm wondering if kafka has any performance characteristics that differ
> > for
> > > the 2 scenarios.
> > >
> > > I like #1 because it simplifies targeted message consumption, and
> enables
> > > more interesting use of TopicFilter'ing.  But I'm unsure whether there
> > > might be performance concerns with kafka (does it have to do more work
> to
> > > separately manage each topic?).  Is this a common use case, or not?
> > >
> > > Thanks for any insight.
> > >
> > > Jason
> >
>

Reply via email to