Re: EventDrivenSource and dead threads

Brock Noland Wed, 16 Jan 2013 23:38:21 -0800

Hi,

Yeah I'd prefer to avoid overriding getLifeCycle to do this since
getters are typically lightweight. As such, I feel it violates the
general contract behind getters.


In regards to channels/sinks/pollable sources, I don't think they need
this behavior as sinks and poll-able sources both have threads polling
them at regular intervals. We have two robust channels memory and file
which don't require this behavior as well. Given this, I feel like
this is limited to EventDrivenSources which don't have a thread
polling them.

On second thought, FLUME-1630 got rid of one layer of threads in the
flume agent, perhaps adding a single threaded scheduled executor to
EventDrivenSource is not a bad solution.

Brock

On Wed, Jan 16, 2013 at 11:02 PM, Connor Woodson <[email protected]> wrote:
> What I was trying to say is that the ScribeSource should not be responsible
> for restarting itself (just from my understanding of your idea; it breaks
> the existing paradigm as components should not have to control their own
> lifecycle). Going from Brock's link, I feel the most dynamic solution would
> be possible to just add in a lifecyle state RESTART and place it in that
> switch statement; when that state is reached, it tries to stop then restart
> the component and then sets the desired state to START (or STOP if it
> couldn't start it again, or set error=true if it couldn't stop it).
>
> And in a way to prevent overriding getLifecycleState to return RESTART,
> there could either be an inherited function from AbstractSource to call for
> a restart, there could maybe be an option to set it to RESTART on crash, or
> something else.
>
> I'll admit though that I know little about the lifecycle system, so I have
> no idea if this idea is any better.
>
> - Connor
>
>
> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
> [email protected]> wrote:
>
>> Sink, Source and Channel all extend LifecycleAware, so the function is
>> available to all components already.
>>
>> I was more questioning whether it was reasonable to start including logic
>> to determine the state. That being said, I think the precedent of just
>> returning the state set by start/stop is more one of habit, so on further
>> thought I don't see it as being unreasonable. I'm going to give fixing
>> ScribeSource with it a poke.
>>
>> As to new lifecycle states, I took a pass at reworking the lifecycle model
>> with guavas service implementation in the past, but it took some very
>> significant changes and didn't get the momentum/interest necessary to keep
>> working on it. That ticket is here https://issues.apache.org/**
>> jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .
>> Brock might be working on it now though the issue doesn't appear to have
>> had attention since october.
>>
>>
>> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>>
>>> Why limit it to the sources? If there is going to be a change to one
>>> component's lifecycle, then I see no reason not to change every
>>> component's
>>> lifecycle.
>>>
>>> Sinks and Channels could very well have this problem; so what about giving
>>> each LifecycleAware component a takePulse method (or something); or to
>>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>>> which, when detected, causes a restart of the component. Then components
>>> would just need to override getLifecycleState (if this method is polled
>>> regularly; I don't know. maybe use a listener for when there's a state
>>> change) to detect if it has crashed/needs to be restarted.
>>>
>>> Just my thoughts,
>>>
>>> - Connor
>>>
>>>
>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>>> [email protected].**jp <[email protected]>>
>>> wrote:
>>>
>>>  Hmm, overriding the implementation of getLifecycleState provided by
>>>> AbstractSource could work. It would be going against the convention that
>>>> has been maintained in all other components(that I can think of)
>>>>
>>>>
>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>>> another thread. An alternative which wouldn't require another thread
>>>>> would be to check the optional interface in the supervisor,
>>>>> approximately here:
>>>>>
>>>>> https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>>> core/src/main/java/org/apache/****flume/lifecycle/**
>>>>> LifecycleSupervisor.java#L240<**https://github.com/apache/**
>>>>> flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**
>>>>> flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>>> >
>>>>>
>>>>>
>>>>> However, I am not sold on the supervisor being the best place to fix
>>>>> this as I am not sure that other lifecycle components would need this.
>>>>>
>>>>>
>>>>>
>>>>> Brock
>>>>>
>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>>> <juhani_connolly@cyberagent.****co.jp <http://co.jp> <
>>>>> juhani_connolly@cyberagent.**co.jp <[email protected]>>>
>>>>>
>>>>> wrote:
>>>>>
>>>>>  I came upon an issue with ScribeSource,  though it's theoretically
>>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>>> die.
>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>>> will
>>>>>> result in it trying to allocate space for some arbitrarily large packet
>>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>>> thought
>>>>>> this would be an issue in avro too, but it throws an exception before
>>>>>> making
>>>>>> excessive allocation requests).
>>>>>>
>>>>>> As far as flume is concerned, the component is still alive. stop() was
>>>>>> never
>>>>>> called, so even monitoring the component state using jmx will not
>>>>>> notice
>>>>>> anything wrong. This situation occurs from user error, but there is
>>>>>> potential for other errors leaving a zombie component. I think it would
>>>>>> be
>>>>>> more user friendly to be able to recover from such errors.
>>>>>>
>>>>>> I'm thinking of adding a StatusPollable interface that
>>>>>> EventDrivenSources
>>>>>> can optionally implement(because we can't change the interface without
>>>>>> a
>>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>>> schedule
>>>>>> a regular poll to check the state. Upon failure it could either call
>>>>>> stop()
>>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>>> restarted by its supervisor.
>>>>>>
>>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>>> issue.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>



-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: EventDrivenSource and dead threads

Reply via email to