Taylor,

Thanks for the detailed response.  I'd be interested to know why you
thought it important to be able to support a large number of topics (as I'm
contemplating as well).  What was your value proposition for that (it seems
like you've gone to great lengths to make it work).

I'm curious, you mention in several places the concern about the threshold
for flushing data to disk.  And it seems you are saying that flushing
sooner than later is desirable.  Why is this?  I would have thought keeping
things in memory longer would be more efficient, etc.

Thanks,

Jason


On Wed, Oct 10, 2012 at 8:13 PM, Taylor Gautier <tgaut...@gmail.com> wrote:

> We used pattern #1 at Tagged.  I wouldn't recommend it unless you're really
> committed.  It took a lot of work to get it working right.
>
> a) Performance degraded non-linearly (read it fell off a cliff) when
> brokers were managing more than about 20k topics.  This was on a Linux RHEL
> 5.3 system with EXT3.  YMMV.
>
> b) Startup time is significantly longer for a broker that is restarted due
> to communication with ZK to sync up on those topics.
>
> c) If topics are short lived, even if Kafka expires the data segments using
> it's standard 0.7 cleaner, the directory name for the topic will still
> exist on disk and the topic is still considered "active" (in memory) in
> Kafka.  This causes problems - see a above (open file handles).
>
> d) Message latency is affected.  Kafka syncs messages to disk if x messages
> have buffered in memory, or y seconds have elapsed (both configurable).  If
> you have few topics and many messages (pattern #2), you will be hitting the
> x limit quite often, and get good throughput.  However, if you have many
> topics and few messages per topic (pattern #1), you will have to rely on
> the y threshold to flush to disk, and setting this too low can impact
> performance (throughput) in a significant way.  Jay already mentioned this
> as random writes.
>
> We had to implement a number of solutions ourselves to resolve these
> issues, namely:
>
> #1 No ZK.  This means that all of the automatic partitioning done by Kafka
> is not available, so we had to roll our own (luckily Tagged is pretty used
> to scaling systems so there was much in-house expertise).  The solution
> here was to implement a R/W proxy layer of machines to intercept messages
> and read/write to/from Kafka handling the sharding at the proxy layer.
>  Because most of our messages were coming from PHP and we didn't want to
> use TCP we needed a UDP/TCP bridge/proxy anyway so this wasn't a huge deal
> (also, we wanted strict ordering of messages, so we needed a shard by topic
> feature anyway (I believe this can be done in 0.7 but we started with 0.6)
>
> #2 Custom cleaner.  We implemented an extra cleanup task inside the Kafka
> process that could completely remove a topic from memory and disk.  For
> clients, this sometimes meant that a subscribed topic suddenly changed it's
> physical offset from some offset X to 0, but that's ok, while technically
> it probably would never happen theoretically clients should have to handle
> this case anyway because the Kafka physical message space is limited to
> 64-bits (again, unlikely to ever wrap in practice, but you never know).
>  Anyway it's pretty easy to handle this just catch the "invalid offset"
> error Kafka gives and start at 0.
>
> #3 Low threshold for flush.  This gave us good latency, but poor throughput
> (relatively speaking).  We had more than enough throughput, but it was
> nowhere near what Kafka can do when setup in pattern #1.
>
> Given that you want to manage "hundreds of thousands of topics" that may
> mean that you would need 10's of Kafka brokers which could be another
> source of problems - it's more cost, more management, and more sources of
> failure.  SSD's may help solve this problem btw, but now you are talking
> expensive machines rather than using just off the shelf cheapo servers with
> standard SATA drives.
>
> On Wed, Oct 10, 2012 at 4:25 PM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
> > Yes the footprint of a topic is one directory per partition (a topic can
> > have many subpartitions per partitions). Each directory contains one or
> > more files (depending on how much data you are retaining and the segment
> > size, both configurable).
> >
> > In addition to having lots of open files, which certainly scales up to
> the
> > hundreds of thousands, this will also impact the I/O pattern. As the
> number
> > of files increases the data written to each file necessarily decreases.
> > This likely means lots of random I/O. The OS can group together writes,
> but
> > if you only doing a single write per topic every now and then there will
> be
> > nothing to group and you will lots of small random I/O. This will
> > definitely impact throughput. I don't know where the practical limits are
> > we have tested up to ~500 topics and see reasonable performance. We have
> > not done serious performance testing with tens of thousands of topics or
> > more.
> >
> > In addition to the filesystem concerns there is metadata kept for each
> > partition in zk, and I believe zk keeps this metadata in memory.
> >
> > -Jay
> >
> > On Wed, Oct 10, 2012 at 4:12 PM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >
> > > Ok,
> > >
> > > Perhaps for the sake of argument, consider the question if we have
> just 1
> > > kafka broker.  It sounds like it will need to keep a file handle open
> for
> > > each topic?  Is that right?
> > >
> > > Jason
> > >
> > > On Wed, Oct 10, 2012 at 4:05 PM, Neha Narkhede <
> neha.narkh...@gmail.com
> > > >wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > We use option #2 at LinkedIn for metrics and tracking data.
> Supporting
> > > > Option #1 in Kafka 0.7 has its challenges since every topic is stored
> > > > on every broker, by design. Hence, the number of topics a cluster can
> > > > support is limited by the IO and number of open file handles on each
> > > > broker. After Kafka 0.8 is released, the distribution of topics to
> > > > brokers is user defined and can scale out with the number of brokers.
> > > > Having said that, some Kafka users have successfully deployed Kafka
> > > > 0.7 clusters hosting very high number of topics. I hope they can
> share
> > > > their experiences here.
> > > >
> > > > Thanks,
> > > > Neha
> > > >
> > > > On Wed, Oct 10, 2012 at 3:57 PM, Jason Rosenberg <j...@squareup.com>
> > > wrote:
> > > > > Hi,
> > > > >
> > > > > I'm exploring using kafka for the first time.
> > > > >
> > > > > I'm contemplating a system where we transmit metric data at regular
> > > > > intervals to kafka.  One question I have is whether to generate
> > simple
> > > > > messages with very little meta data (just timestamp and value), and
> > > > keeping
> > > > > meta data like the name/host/app that generated metric out of the
> > > > message,
> > > > > and have that be embodied in the name of the topic itself instead.
> > > > >  Alternatively, we could have a relatively small number of topics,
> > > which
> > > > > contain messages which include source meta data along with the
> > > timestamp
> > > > > and metric value in each message.
> > > > >
> > > > > 1. On one hand, we'd have a large number of topics (say several
> > hundred
> > > > > thousand topics) with small messages, generated at a steady rate
> (say
> > > one
> > > > > every 10 seconds).
> > > > >
> > > > > 2. Alternatively, we could have just few topics, which receive
> > several
> > > > > hundred thousand messages every 10 seconds, which contain 2 or 3
> > times
> > > > more
> > > > > data per message.
> > > > >
> > > > > I'm wondering if kafka has any performance characteristics that
> > differ
> > > > for
> > > > > the 2 scenarios.
> > > > >
> > > > > I like #1 because it simplifies targeted message consumption, and
> > > enables
> > > > > more interesting use of TopicFilter'ing.  But I'm unsure whether
> > there
> > > > > might be performance concerns with kafka (does it have to do more
> > work
> > > to
> > > > > separately manage each topic?).  Is this a common use case, or not?
> > > > >
> > > > > Thanks for any insight.
> > > > >
> > > > > Jason
> > > >
> > >
> >
>

Reply via email to