I think there is no inherent reason we couldn't include a "transformation" plug in that runs before data is written. But after some bad experiences I am kind of fundamentally against allowing application code into the infrastructure process. Can you flesh out the use case a little more with some example? Wouldn't doing a post-aggregation and re-publication to another topic work just as well?
-Jay On Thu, May 17, 2012 at 6:40 AM, S Ahmed <sahmed1...@gmail.com> wrote: > Oh, maybe this isn't possible again since the object is mapped to a file, > and it may already have flushed data at the os level? > > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <sahmed1...@gmail.com> wrote: > >> One downside is if my logic was messed up, I don't have a timeframe of >> rolling the logic back (which was one of the benefits of kafka's design >> choice of having messages kept around for x days). >> >> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <sahmed1...@gmail.com> wrote: >> >>> What do you mean? >>> >>> " I think the direction we are going >>> is instead to just let you co-locate this processing on the same box. >>> This gives the isolation of separate processes and the overhead of the >>> transfer over localhost is pretty minor. " >>> >>> >>> I see what your saying as it is a specific implemention/use case that >>> diverts from a general purpose mechanism, that's why I was suggesting maybe >>> a hook/event based system. >>> >>> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <jay.kr...@gmail.com> wrote: >>> >>>> Yeah I see where you are going with that. We toyed with this idea, but >>>> the idea of coupling processing to the log storage raises a lot of >>>> problems for general purpose usage. I think the direction we are going >>>> is instead to just let you co-locate this processing on the same box. >>>> This gives the isolation of separate processes and the overhead of the >>>> transfer over localhost is pretty minor. >>>> >>>> -Jay >>>> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <sahmed1...@gmail.com> wrote: >>>> > Would it be possible to filter the collection before it gets flush to >>>> disk? >>>> > >>>> > Say I am tracking page views per user, and I could perform a rollup >>>> before >>>> > it gets flushed to disk (using a hashmap with the key being the >>>> sessionId, >>>> > and increment a counter for the duplicate entries). >>>> > >>>> > And could this be done w/o modifying the original source, maybe through >>>> > some sort of event/listener? >>>> >>> >>> >>