Ahmed,

The consumer rebalancing design is described here, towards the end of the
page - http://incubator.apache.org/kafka/design.html

Please let us know if it is unclear and doesn't answer your questions.

Thanks,
Neha

On Wed, May 2, 2012 at 10:55 AM, S Ahmed <sahmed1...@gmail.com> wrote:

> Neha,
>
> Why does this repartiion occurr?  Is this if a particular topic reaches a
> size or # of messages, it re-balances?
>
> If I don't care about re-partitioning, I can just write my consuming code
> such that IF the userid is the same, aggregate on that key, if its a new
> key, create a new entry in the diciontionary (assuming I use a dictionary,
> where the key is the userId and the value is the aggregation of the
> messages).
>
> I was just aiming to be more efficient that just reading random messages.
>
> On Wed, May 2, 2012 at 12:31 PM, Neha Narkhede <neha.narkh...@gmail.com
> >wrote:
>
> > Ahmed,
> >
> > Your use case sounds similar to what Peter mentioned in another thread -
> >
> >
> http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201205.mbox/ajax/%3CCAEJzOMYROfvJo6u-qPJ0xLjF69Asod6zowkDKc8PpE2457nWDg%40mail.gmail.com%3E
> >
> > On the producer side, you can use a Partitioner to partition the Kafka
> > messages by userid. This would ensure that data from a particular user
> > always ends up in the same partition[1].
> >
> > On the consumer side, you can imagine the application that does the user
> > click counting to be one Kafka consumer group. In steady state, one
> > partition will always be consumed by only one of the consumers in this
> > group. So you could maintain some cache to hold user click counts.
> However,
> > when rebalancing happens, the partition could be consumed by another
> > consumer. So, right before the rebalancing operation, you would want to
> > flush your userid counts, so it can be picked up by the next consumer
> that
> > would consume data from that user's partition.
> >
> > Thanks,
> > Neha
> >
> > 1. Note that, the producer side sticky partitioning guarantees are not
> > ideal in Kafka 0.7. This is because when brokers are bounced, partitions
> > can become unavailable for some time. During this time, the user's data
> can
> > be routed to another partition. However, with Kafka 0.8, we are working
> to
> > add intra-cluster replication that would guarantee the availability of a
> > partition even in the presence of broker failures.
> >
> >
> > On Wed, May 2, 2012 at 9:05 AM, S Ahmed <sahmed1...@gmail.com> wrote:
> >
> > > Trying to understand how kafka could be used in the following scenerio:
> > >
> > > Say I am creating a Saas application for website click tracking.  So a
> > > client would paste some javascript on their website,
> > > and any links click on their website would result in a api call that
> > would
> > > log the click (ip address, link meta data, timestamp, session guid,
> etc).
> > >
> > > Since these api calls are coming from remote servers, I'm guessing I
> > would
> > > be wrapping the calls to kafka via a http server e.g. a jetty servlet
> > > handler would take the http call made via the api and then write to a
> > kafka
> > > topic.
> > >
> > > Am I right so far?
> > >
> > > Now how could I partition the data in a way that would make consuming
> > more
> > > efficient?
> > > i.e. I am tracking click counts for visitors to a website, it would be
> > > probable that a user will have multiple messages written to kafka in a
> > > given session, so on the consumer end if I could read in batches and
> > > aggregate before I write the 'rolled up' data to mysql that would be
> > ideal.
> > >
> > > I read the kafka design page, and I understand at a high level that
> > > consumers can be 'grouped'.
> > >
> > > Looking for someone to clarify how this usecase could be solved with
> > kafka,
> > > particularly
> > > how partitioning and consumption works (still not 100% clear on those
> and
> > > hopefully this sample use case will clear that up).
> > >
> >
>

Reply via email to