Ahmed, The consumer rebalancing design is described here, towards the end of the page - http://incubator.apache.org/kafka/design.html
Please let us know if it is unclear and doesn't answer your questions. Thanks, Neha On Wed, May 2, 2012 at 10:55 AM, S Ahmed <sahmed1...@gmail.com> wrote: > Neha, > > Why does this repartiion occurr? Is this if a particular topic reaches a > size or # of messages, it re-balances? > > If I don't care about re-partitioning, I can just write my consuming code > such that IF the userid is the same, aggregate on that key, if its a new > key, create a new entry in the diciontionary (assuming I use a dictionary, > where the key is the userId and the value is the aggregation of the > messages). > > I was just aiming to be more efficient that just reading random messages. > > On Wed, May 2, 2012 at 12:31 PM, Neha Narkhede <neha.narkh...@gmail.com > >wrote: > > > Ahmed, > > > > Your use case sounds similar to what Peter mentioned in another thread - > > > > > http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E > > > > On the producer side, you can use a Partitioner to partition the Kafka > > messages by userid. This would ensure that data from a particular user > > always ends up in the same partition[1]. > > > > On the consumer side, you can imagine the application that does the user > > click counting to be one Kafka consumer group. In steady state, one > > partition will always be consumed by only one of the consumers in this > > group. So you could maintain some cache to hold user click counts. > However, > > when rebalancing happens, the partition could be consumed by another > > consumer. So, right before the rebalancing operation, you would want to > > flush your userid counts, so it can be picked up by the next consumer > that > > would consume data from that user's partition. > > > > Thanks, > > Neha > > > > 1. Note that, the producer side sticky partitioning guarantees are not > > ideal in Kafka 0.7. This is because when brokers are bounced, partitions > > can become unavailable for some time. During this time, the user's data > can > > be routed to another partition. However, with Kafka 0.8, we are working > to > > add intra-cluster replication that would guarantee the availability of a > > partition even in the presence of broker failures. > > > > > > On Wed, May 2, 2012 at 9:05 AM, S Ahmed <sahmed1...@gmail.com> wrote: > > > > > Trying to understand how kafka could be used in the following scenerio: > > > > > > Say I am creating a Saas application for website click tracking. So a > > > client would paste some javascript on their website, > > > and any links click on their website would result in a api call that > > would > > > log the click (ip address, link meta data, timestamp, session guid, > etc). > > > > > > Since these api calls are coming from remote servers, I'm guessing I > > would > > > be wrapping the calls to kafka via a http server e.g. a jetty servlet > > > handler would take the http call made via the api and then write to a > > kafka > > > topic. > > > > > > Am I right so far? > > > > > > Now how could I partition the data in a way that would make consuming > > more > > > efficient? > > > i.e. I am tracking click counts for visitors to a website, it would be > > > probable that a user will have multiple messages written to kafka in a > > > given session, so on the consumer end if I could read in batches and > > > aggregate before I write the 'rolled up' data to mysql that would be > > ideal. > > > > > > I read the kafka design page, and I understand at a high level that > > > consumers can be 'grouped'. > > > > > > Looking for someone to clarify how this usecase could be solved with > > kafka, > > > particularly > > > how partitioning and consumption works (still not 100% clear on those > and > > > hopefully this sample use case will clear that up). > > > > > >