Ahmed, Your use case sounds similar to what Peter mentioned in another thread - http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E
On the producer side, you can use a Partitioner to partition the Kafka messages by userid. This would ensure that data from a particular user always ends up in the same partition[1]. On the consumer side, you can imagine the application that does the user click counting to be one Kafka consumer group. In steady state, one partition will always be consumed by only one of the consumers in this group. So you could maintain some cache to hold user click counts. However, when rebalancing happens, the partition could be consumed by another consumer. So, right before the rebalancing operation, you would want to flush your userid counts, so it can be picked up by the next consumer that would consume data from that user's partition. Thanks, Neha 1. Note that, the producer side sticky partitioning guarantees are not ideal in Kafka 0.7. This is because when brokers are bounced, partitions can become unavailable for some time. During this time, the user's data can be routed to another partition. However, with Kafka 0.8, we are working to add intra-cluster replication that would guarantee the availability of a partition even in the presence of broker failures. On Wed, May 2, 2012 at 9:05 AM, S Ahmed <sahmed1...@gmail.com> wrote: > Trying to understand how kafka could be used in the following scenerio: > > Say I am creating a Saas application for website click tracking. So a > client would paste some javascript on their website, > and any links click on their website would result in a api call that would > log the click (ip address, link meta data, timestamp, session guid, etc). > > Since these api calls are coming from remote servers, I'm guessing I would > be wrapping the calls to kafka via a http server e.g. a jetty servlet > handler would take the http call made via the api and then write to a kafka > topic. > > Am I right so far? > > Now how could I partition the data in a way that would make consuming more > efficient? > i.e. I am tracking click counts for visitors to a website, it would be > probable that a user will have multiple messages written to kafka in a > given session, so on the consumer end if I could read in batches and > aggregate before I write the 'rolled up' data to mysql that would be ideal. > > I read the kafka design page, and I understand at a high level that > consumers can be 'grouped'. > > Looking for someone to clarify how this usecase could be solved with kafka, > particularly > how partitioning and consumption works (still not 100% clear on those and > hopefully this sample use case will clear that up). >