Sink Behavior Standardization

Arvind Prabhakar Mon, 27 Feb 2012 18:52:43 -0800

On Mon, Feb 27, 2012 at 6:33 PM, Brock Noland <[email protected]> wrote:


> Hi,
>
> Thank you everyone for you comments! My comments are inline.
>
> On Tue, Feb 28, 2012 at 1:13 AM, Arvind Prabhakar <[email protected]>
> wrote:
> > My bad for not chiming on this thread soon enough. When we laid out
> > the initial architecture, the following assumptions were made and I
> > still think that most of them are valid:
> >
> > 1. Sources doing put() on channel should relay back any exceptions
> > they receive from the channel. They should not die or become invalid
> > due to this. If they do, it is more of a bug in source implementation.
> >
> > 2. Channels must respect capacity. This is vital for operators to
> > ensure that they can size a system without overwhelming it. Both mem
> > and jdbc channels support size specification at this time.
>
> yes with a recent commit it looks like MemoryChannel was changed to
> work like JDBCChannel and throws an exception if full.
>
> >
> > 3. Channels should never block. This is to ensure that there is no
> > scope of threads deadlocking within the agent due to bugs or invalid
> > state of the system. The chosen alternative to blocking was the notion
> > of the sink runner which will honor backoff strategy when necessary.
> > Consequently the implementation of sink should send the correct signal
> > to the runner in case it is not able to take events from the channel
> > or deliver events to the downstream destination.
>
> MemoryChannel.take blocks:
>
>      if(!queueStored.tryAcquire(keepAlive, TimeUnit.SECONDS)) {
>

This is perhaps needed to fine tune the memory channel throughput in a
highly concurrent system. For example, if multiple threads are
reading/writing to the channel, it is possible that the channel may not get
handle on the necessary locks in every request - hence a basic timeout
mechanism would help in ensuring reliability of the channel. I do think
though that the default of 3 seconds is a bit longish and should instead be
more like 100ms or something.


>
> JDBCChannel does not. Meaning that JDBCChannel + HDFSEventSink
> consumes an entire CPU when no events are flowing through the system.
> It sounds like HDFSEventSink should be returning BACKOFF when the
> channel returns null?
>

Correct. The idea is that any logic common to flow
handling/throttling should be implemented by the runner and not by the
individual sink or channel. That will likely not be as sophisticated as
what one can do within a specialized implementation, but certainly serves
the purpose and keeps things uniform and simple.


>
> My main concern with this strategy is that SinkRunner is going to
> sleep for some number of seconds regardless of events being pushed to
> the channel. Users that have very spiky flows have two options:
>
> 1) Turn the sleep down considerably which burns unneeded CPU
> 2) Size the channel large enough to handle the largest spike during
> the specified sleep time
>
> Shouldn't sink runner be event driven and start processing events as
> soon as they arrive on the channel?
>

This does make a lot of sense. However, one of the immediate trade offs of
the flume architecture was to specifically sacrifice blocking behavior in
favor of implementing simple semantics. It is very possible that as flume
gets adopted in the field we realize that this is more of a pressing need
than what we expected it to ben then it will surely get prioritized.

Thanks,
Arvind


>
> Cheers!
> Brock
>
> >
> > At some point in time, when we have the basic implementation of Flume
> > working in production to validate all of these semantics, we can start
> > discussion on how best these semantics can change to accommodate any
> > new findings that we discover in the field.
> >
> > Thanks,
> > Arvind Prabhakar
> >
> > On Mon, Feb 27, 2012 at 11:30 AM, Prasad Mujumdar <[email protected]>
> wrote:
> >>  IMO the blocking vs wait time should be an attribute of the flow and
> not
> >> individual component. Perhaps each source/sink/channel should make it
> >> configurable (with consistent default) so that it it can be tweaked per
> the
> >> use case. The common attributes like timeout, capacity can be standard
> >> configurations that each component should support wherever possible.
> >>
> >> @Brock, I will try to include the relevant conclusions of this
> discussion
> >> in the dev guide.
> >>
> >> thanks
> >> Prasad
> >>
> >>
> >> On Mon, Feb 27, 2012 at 7:35 AM, Peter Newcomb <[email protected]
> >wrote:
> >>
> >>> Juhani, FWIW I agree with most of what you described, based on my
> reading
> >>> and use of the codebase.  Brock, I agree that these things are not yet
> >>> adequately documented--especially in terms of Javadocs for the main
> >>> interfaces: Source, Channel, and Sink.  Also, there is enough variation
> >>> among the various implementations of these interfaces to lead to
> ambiguous
> >>> interpretation.
> >>>
> >>> One thing I wanted to comment on specifically is Juhani's statement
> about
> >>> channel capacity:
> >>>
> >>> > Channels:
> >>> > - Only memory channels have a capacity, but when that is exceeded
> >>> > ChannelException seems a clearcut reaction
> >>>
> >>> Before your recent refactoring of MemoryChannel, put() would block
> >>> indefinitely if the queue was at capacity--are you suggesting that
> this was
> >>> incorrect behavior that should not be allowed?  Or just that any such
> >>> blocking should have a finite duration (similar to take() keep-alive),
> and
> >>> throw ChannelException upon timeout?
> >>>
> >>> Also, other channels may well have implicit capacities, for instance
> >>> available space in a database or filesystem partition, though I agree
> that
> >>> ChannelException would be appropriate in those cases.
> >>>
> >>> -peter
> >>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>

Re: Source/Channel/Sink Behavior Standardization

Reply via email to