Neha, Why does this repartiion occurr? Is this if a particular topic reaches a size or # of messages, it re-balances?
If I don't care about re-partitioning, I can just write my consuming code such that IF the userid is the same, aggregate on that key, if its a new key, create a new entry in the diciontionary (assuming I use a dictionary, where the key is the userId and the value is the aggregation of the messages). I was just aiming to be more efficient that just reading random messages. On Wed, May 2, 2012 at 12:31 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > Ahmed, > > Your use case sounds similar to what Peter mentioned in another thread - > > http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E > > On the producer side, you can use a Partitioner to partition the Kafka > messages by userid. This would ensure that data from a particular user > always ends up in the same partition[1]. > > On the consumer side, you can imagine the application that does the user > click counting to be one Kafka consumer group. In steady state, one > partition will always be consumed by only one of the consumers in this > group. So you could maintain some cache to hold user click counts. However, > when rebalancing happens, the partition could be consumed by another > consumer. So, right before the rebalancing operation, you would want to > flush your userid counts, so it can be picked up by the next consumer that > would consume data from that user's partition. > > Thanks, > Neha > > 1. Note that, the producer side sticky partitioning guarantees are not > ideal in Kafka 0.7. This is because when brokers are bounced, partitions > can become unavailable for some time. During this time, the user's data can > be routed to another partition. However, with Kafka 0.8, we are working to > add intra-cluster replication that would guarantee the availability of a > partition even in the presence of broker failures. > > > On Wed, May 2, 2012 at 9:05 AM, S Ahmed <sahmed1...@gmail.com> wrote: > > > Trying to understand how kafka could be used in the following scenerio: > > > > Say I am creating a Saas application for website click tracking. So a > > client would paste some javascript on their website, > > and any links click on their website would result in a api call that > would > > log the click (ip address, link meta data, timestamp, session guid, etc). > > > > Since these api calls are coming from remote servers, I'm guessing I > would > > be wrapping the calls to kafka via a http server e.g. a jetty servlet > > handler would take the http call made via the api and then write to a > kafka > > topic. > > > > Am I right so far? > > > > Now how could I partition the data in a way that would make consuming > more > > efficient? > > i.e. I am tracking click counts for visitors to a website, it would be > > probable that a user will have multiple messages written to kafka in a > > given session, so on the consumer end if I could read in batches and > > aggregate before I write the 'rolled up' data to mysql that would be > ideal. > > > > I read the kafka design page, and I understand at a high level that > > consumers can be 'grouped'. > > > > Looking for someone to clarify how this usecase could be solved with > kafka, > > particularly > > how partitioning and consumption works (still not 100% clear on those and > > hopefully this sample use case will clear that up). > > >