Re: filter before flush to disk

Jay Kreps Thu, 17 May 2012 15:34:39 -0700

Yeah so our current recommendation would be to do that as post
processing as a consumer. It can store its results back to another
topic if needed. This gives a clean seperation between the log of
incoming data and the aggregation process. If you co-locate these
things (same machine differrent process) the overhead should be pretty
small.


-Jay

On Thu, May 17, 2012 at 2:32 PM, S Ahmed <sahmed1...@gmail.com> wrote:
> Say I am storing messages like this:
>
> sessionID, year-month-day-hour-minute-second, value
>
> Now say I only need to stats at the minute level, or hour level, this means
> that i could save allot of hard drive space by rolling it up before it gets
> persisted to disk.
>
> i.e. I could roll up hundreds of messages per sessionId to a single message.
>
> That's pretty much it, and maybe your right it is mixing things and others
> might not thing it is useful.
>
>
> On Thu, May 17, 2012 at 11:02 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>
>> I think there is no inherent reason we couldn't include a
>> "transformation" plug in that runs before data is written. But after
>> some bad experiences I am kind of fundamentally against allowing
>> application code into the infrastructure process. Can you flesh out
>> the use case a little more with some example? Wouldn't doing a
>> post-aggregation and re-publication to another topic work just as
>> well?
>>
>> -Jay
>>
>> On Thu, May 17, 2012 at 6:40 AM, S Ahmed <sahmed1...@gmail.com> wrote:
>> > Oh, maybe this isn't possible again since the object is mapped to a file,
>> > and it may already have flushed data at the os level?
>> >
>> > On Tue, May 15, 2012 at 11:43 AM, S Ahmed <sahmed1...@gmail.com> wrote:
>> >
>> >> One downside is if my logic was messed up, I don't have a timeframe of
>> >> rolling the logic back (which was one of the benefits of kafka's design
>> >> choice of having messages kept around for x days).
>> >>
>> >>
>> >> On Tue, May 15, 2012 at 11:42 AM, S Ahmed <sahmed1...@gmail.com> wrote:
>> >>
>> >>> What do you mean?
>> >>>
>> >>> "  I think the direction we are going
>> >>> is instead to just let you co-locate this processing on the same box.
>> >>> This gives the isolation of separate processes and the overhead of the
>> >>> transfer over localhost is pretty minor. "
>> >>>
>> >>>
>> >>> I see what your saying as it is a specific implemention/use case that
>> >>> diverts from a general purpose mechanism, that's why I was suggesting
>> maybe
>> >>> a hook/event based system.
>> >>>
>> >>>
>> >>> On Tue, May 15, 2012 at 11:24 AM, Jay Kreps <jay.kr...@gmail.com>
>> wrote:
>> >>>
>> >>>> Yeah I see where you are going with that. We toyed with this idea, but
>> >>>> the idea of coupling processing to the log storage raises a lot of
>> >>>> problems for general purpose usage. I think the direction we are going
>> >>>> is instead to just let you co-locate this processing on the same box.
>> >>>> This gives the isolation of separate processes and the overhead of the
>> >>>> transfer over localhost is pretty minor.
>> >>>>
>> >>>> -Jay
>> >>>>
>> >>>> On Tue, May 15, 2012 at 6:38 AM, S Ahmed <sahmed1...@gmail.com>
>> wrote:
>> >>>> > Would it be possible to filter the collection before it gets flush
>> to
>> >>>> disk?
>> >>>> >
>> >>>> > Say I am tracking page views per user, and I could perform a rollup
>> >>>> before
>> >>>> > it gets flushed to disk (using a hashmap with the key being the
>> >>>> sessionId,
>> >>>> > and increment a counter for the duplicate entries).
>> >>>> >
>> >>>> > And could this be done w/o modifying the original source, maybe
>> through
>> >>>> > some sort of event/listener?
>> >>>>
>> >>>
>> >>>
>> >>
>>

Re: filter before flush to disk

Reply via email to