A few things I've learned: 1) Don't break things up into separate topics unless the data in them is truly independent. Consumer behavior can be extremely variable, don't assume you will always be consuming as fast as you are producing.
2) Keep time related messages in the same partition. Again, consumer behavior can (and will be) extremely variable, don't assume the lag on all your partitions will be similar. Design a partitioning scheme, so that the owner of one partition can stop consuming for a long period of time and your application will be minimally impacted. (for example, partitioning by transaction id) On Fri, May 23, 2014 at 1:12 PM, Joel Koshy <[email protected]> wrote: > Take a look at: > > https://cwiki.apache.org/confluence/display/KAFKA/FAQ#FAQ-HowdoIchoosethenumberofpartitionsforatopic > ? > > On Fri, May 23, 2014 at 12:49:39PM -0700, Bhavesh Mistry wrote: > > Hi Kafka Users, > > > > > > > > We are trying to transport 4TB data per day on single topic. It is > > operation application logs. How do we estimate number of partitions > and > > partitioning strategy? Our goal is to drain (from consumer side) from > > the Kafka Brokers as soon as messages arrive (keep the lag as minimum as > > possible) and also we would like to uniformly distribute the logs across > > all partitions. > > > > > > > > Here is our Brokers HW Spec: > > > > 3 Broker Cluster (192 GB RAM, 32 Cores each with SSD to hold 7 days of > data > > ) with 100G NIC > > > > > > > > Data Rate : ~ 13 GB per minute > > > > > > > > > > > > Is there a formula to compute optimal number of partition need ? Also, > how > > to ensure uniform distribution from the producer side (currently we have > > counter % numPartitions which is not viable solution in prod env) > > > > > > > > Thanks, > > Bhavesh > >
