Also another issue would be duplicate messages, since kafka doesn't guarantee that each message is unique, you would have to somehow coordinate between the consumers if a message has been accounted for or not (which again makes another point for not filtering pre flush).
On Thu, May 17, 2012 at 6:34 PM, Jay Kreps <jay.kr...@gmail.com> wrote: > Yeah so our current recommendation would be to do that as post > processing as a consumer. It can store its results back to another > topic if needed. This gives a clean seperation between the log of > incoming data and the aggregation process. If you co-locate these > things (same machine differrent process) the overhead should be pretty > small. > > -Jay > > On Thu, May 17, 2012 at 2:32 PM, S Ahmed <sahmed1...@gmail.com> wrote: > > Say I am storing messages like this: > > > > sessionID, year-month-day-hour-minute-second, value > > > > Now say I only need to stats at the minute level, or hour level, this > means > > that i could save allot of hard drive space by rolling it up before it > gets > > persisted to disk. > > > > i.e. I could roll up hundreds of messages per sessionId to a single > message. > > > > That's pretty much it, and maybe your right it is mixing things and > others > > might not thing it is useful. > > > > > > On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > > > >> I think there is no inherent reason we couldn't include a > >> "transformation" plug in that runs before data is written. But after > >> some bad experiences I am kind of fundamentally against allowing > >> application code into the infrastructure process. Can you flesh out > >> the use case a little more with some example? Wouldn't doing a > >> post-aggregation and re-publication to another topic work just as > >> well? > >> > >> -Jay > >> > >> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <sahmed1...@gmail.com> wrote: > >> > Oh, maybe this isn't possible again since the object is mapped to a > file, > >> > and it may already have flushed data at the os level? > >> > > >> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <sahmed1...@gmail.com> > wrote: > >> > > >> >> One downside is if my logic was messed up, I don't have a timeframe > of > >> >> rolling the logic back (which was one of the benefits of kafka's > design > >> >> choice of having messages kept around for x days). > >> >> > >> >> > >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <sahmed1...@gmail.com> > wrote: > >> >> > >> >>> What do you mean? > >> >>> > >> >>> " I think the direction we are going > >> >>> is instead to just let you co-locate this processing on the same > box. > >> >>> This gives the isolation of separate processes and the overhead of > the > >> >>> transfer over localhost is pretty minor. " > >> >>> > >> >>> > >> >>> I see what your saying as it is a specific implemention/use case > that > >> >>> diverts from a general purpose mechanism, that's why I was > suggesting > >> maybe > >> >>> a hook/event based system. > >> >>> > >> >>> > >> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <jay.kr...@gmail.com> > >> wrote: > >> >>> > >> >>>> Yeah I see where you are going with that. We toyed with this idea, > but > >> >>>> the idea of coupling processing to the log storage raises a lot of > >> >>>> problems for general purpose usage. I think the direction we are > going > >> >>>> is instead to just let you co-locate this processing on the same > box. > >> >>>> This gives the isolation of separate processes and the overhead of > the > >> >>>> transfer over localhost is pretty minor. > >> >>>> > >> >>>> -Jay > >> >>>> > >> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <sahmed1...@gmail.com> > >> wrote: > >> >>>> > Would it be possible to filter the collection before it gets > flush > >> to > >> >>>> disk? > >> >>>> > > >> >>>> > Say I am tracking page views per user, and I could perform a > rollup > >> >>>> before > >> >>>> > it gets flushed to disk (using a hashmap with the key being the > >> >>>> sessionId, > >> >>>> > and increment a counter for the duplicate entries). > >> >>>> > > >> >>>> > And could this be done w/o modifying the original source, maybe > >> through > >> >>>> > some sort of event/listener? > >> >>>> > >> >>> > >> >>> > >> >> > >> >