Say I am storing messages like this: sessionID, year-month-day-hour-minute-second, value
Now say I only need to stats at the minute level, or hour level, this means that i could save allot of hard drive space by rolling it up before it gets persisted to disk. i.e. I could roll up hundreds of messages per sessionId to a single message. That's pretty much it, and maybe your right it is mixing things and others might not thing it is useful. On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <jay.kr...@gmail.com> wrote: > I think there is no inherent reason we couldn't include a > "transformation" plug in that runs before data is written. But after > some bad experiences I am kind of fundamentally against allowing > application code into the infrastructure process. Can you flesh out > the use case a little more with some example? Wouldn't doing a > post-aggregation and re-publication to another topic work just as > well? > > -Jay > > On Thu, May 17, 2012 at 6:40 AM, S Ahmed <sahmed1...@gmail.com> wrote: > > Oh, maybe this isn't possible again since the object is mapped to a file, > > and it may already have flushed data at the os level? > > > > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <sahmed1...@gmail.com> wrote: > > > >> One downside is if my logic was messed up, I don't have a timeframe of > >> rolling the logic back (which was one of the benefits of kafka's design > >> choice of having messages kept around for x days). > >> > >> > >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <sahmed1...@gmail.com> wrote: > >> > >>> What do you mean? > >>> > >>> " I think the direction we are going > >>> is instead to just let you co-locate this processing on the same box. > >>> This gives the isolation of separate processes and the overhead of the > >>> transfer over localhost is pretty minor. " > >>> > >>> > >>> I see what your saying as it is a specific implemention/use case that > >>> diverts from a general purpose mechanism, that's why I was suggesting > maybe > >>> a hook/event based system. > >>> > >>> > >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <jay.kr...@gmail.com> > wrote: > >>> > >>>> Yeah I see where you are going with that. We toyed with this idea, but > >>>> the idea of coupling processing to the log storage raises a lot of > >>>> problems for general purpose usage. I think the direction we are going > >>>> is instead to just let you co-locate this processing on the same box. > >>>> This gives the isolation of separate processes and the overhead of the > >>>> transfer over localhost is pretty minor. > >>>> > >>>> -Jay > >>>> > >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <sahmed1...@gmail.com> > wrote: > >>>> > Would it be possible to filter the collection before it gets flush > to > >>>> disk? > >>>> > > >>>> > Say I am tracking page views per user, and I could perform a rollup > >>>> before > >>>> > it gets flushed to disk (using a hashmap with the key being the > >>>> sessionId, > >>>> > and increment a counter for the duplicate entries). > >>>> > > >>>> > And could this be done w/o modifying the original source, maybe > through > >>>> > some sort of event/listener? > >>>> > >>> > >>> > >> >