Re: EventDrivenSource and dead threads

Connor Woodson Thu, 17 Jan 2013 00:55:02 -0800

I don't know if this is within the scope of this change, but sinks
sometimes will need to be restarted; for instance, I had an HDFS sink crash
from an Out of Memory error (caused by the JIRA I filed; I wasn't planning
on using a large heap) and it doesn't automatically restart; I don't know
if there's a nice way to detect of a sink/source has crashed, but if so it
would be nice to have them auto-restart when they go down.


- Connor


On Thu, Jan 17, 2013 at 12:27 AM, Juhani Connolly <
[email protected]> wrote:

> What I described isn't really taking control of lifecycle.
>
> What would happen is:
> - source start. State=start
> - OOM exception happens
> - getLifecycleState called, override calls server.isServing(). This is a
> lightweight call. If it returns false, it either calls stop() or local
> lifeCycleState is set to ERROR (I'm not actually sure how this is handled).
> The latter is probably the correct call in this case, but I'm not entirely
> sure if the supervisor will ever restart it then.
>
> The supervisor would still be doing any restarting. I guess it's really a
> matter of who detects the bad state(getLifecycleState or a scheduled
> runnable) and what that call is allowed to do(switch state to ERROR or call
> stop()). When the source breaks, changing its status to accurately reflect
> this is not breaking the paradigm(though calling stop() may be)
>
> It looks to me like switching the state to ERROR should be sufficient...
> Monitor should then try to start the source again. However not calling
> stop() before that may cause resource leakage? Could someone confirm the
> behavior in regards to ERROR?
>
>
> On 01/17/2013 04:02 PM, Connor Woodson wrote:
>
>> What I was trying to say is that the ScribeSource should not be
>> responsible
>> for restarting itself (just from my understanding of your idea; it breaks
>> the existing paradigm as components should not have to control their own
>> lifecycle). Going from Brock's link, I feel the most dynamic solution
>> would
>> be possible to just add in a lifecyle state RESTART and place it in that
>> switch statement; when that state is reached, it tries to stop then
>> restart
>> the component and then sets the desired state to START (or STOP if it
>> couldn't start it again, or set error=true if it couldn't stop it).
>>
>> And in a way to prevent overriding getLifecycleState to return RESTART,
>> there could either be an inherited function from AbstractSource to call
>> for
>> a restart, there could maybe be an option to set it to RESTART on crash,
>> or
>> something else.
>>
>> I'll admit though that I know little about the lifecycle system, so I have
>> no idea if this idea is any better.
>>
>> - Connor
>>
>>
>> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
>> [email protected].**jp <[email protected]>>
>> wrote:
>>
>>  Sink, Source and Channel all extend LifecycleAware, so the function is
>>> available to all components already.
>>>
>>> I was more questioning whether it was reasonable to start including logic
>>> to determine the state. That being said, I think the precedent of just
>>> returning the state set by start/stop is more one of habit, so on further
>>> thought I don't see it as being unreasonable. I'm going to give fixing
>>> ScribeSource with it a poke.
>>>
>>> As to new lifecycle states, I took a pass at reworking the lifecycle
>>> model
>>> with guavas service implementation in the past, but it took some very
>>> significant changes and didn't get the momentum/interest necessary to
>>> keep
>>> working on it. That ticket is here https://issues.apache.org/**
>>> jira/browse/FLUME-966 
>>> <https://issues.apache.org/**jira/browse/FLUME-966<https://issues.apache.org/jira/browse/FLUME-966>>
>>> .
>>>
>>> Brock might be working on it now though the issue doesn't appear to have
>>> had attention since october.
>>>
>>>
>>> On 01/17/2013 03:01 PM, Connor Woodson wrote:
>>>
>>>  Why limit it to the sources? If there is going to be a change to one
>>>> component's lifecycle, then I see no reason not to change every
>>>> component's
>>>> lifecycle.
>>>>
>>>> Sinks and Channels could very well have this problem; so what about
>>>> giving
>>>> each LifecycleAware component a takePulse method (or something); or to
>>>> avoid creating a new method, add a new lifecycle state 'crashed' or such
>>>> which, when detected, causes a restart of the component. Then components
>>>> would just need to override getLifecycleState (if this method is polled
>>>> regularly; I don't know. maybe use a listener for when there's a state
>>>> change) to detect if it has crashed/needs to be restarted.
>>>>
>>>> Just my thoughts,
>>>>
>>>> - Connor
>>>>
>>>>
>>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
>>>> [email protected].****jp <juhani_connolly@cyberagent.**
>>>> co.jp <[email protected]>>>
>>>>
>>>> wrote:
>>>>
>>>>   Hmm, overriding the implementation of getLifecycleState provided by
>>>>
>>>>> AbstractSource could work. It would be going against the convention
>>>>> that
>>>>> has been maintained in all other components(that I can think of)
>>>>>
>>>>>
>>>>> On 01/17/2013 01:20 PM, Brock Noland wrote:
>>>>>
>>>>>   Hi,
>>>>>
>>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet
>>>>>> another thread. An alternative which wouldn't require another thread
>>>>>> would be to check the optional interface in the supervisor,
>>>>>> approximately here:
>>>>>>
>>>>>> https://github.com/apache/******flume/blob/trunk/flume-ng-**<https://github.com/apache/****flume/blob/trunk/flume-ng-**>
>>>>>> <h**ttps://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
>>>>>> >
>>>>>> core/src/main/java/org/apache/******flume/lifecycle/**
>>>>>> LifecycleSupervisor.java#L240<****https://github.com/apache/**
>>>>>>
>>>>>> flume/blob/trunk/flume-ng-****core/src/main/java/org/apache/****
>>>>>> flume/lifecycle/****LifecycleSupervisor.java#L240<**
>>>>>> https://github.com/apache/**flume/blob/trunk/flume-ng-**
>>>>>> core/src/main/java/org/apache/**flume/lifecycle/**
>>>>>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>
>>>>>> >
>>>>>>
>>>>>> However, I am not sold on the supervisor being the best place to fix
>>>>>> this as I am not sure that other lifecycle components would need this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Brock
>>>>>>
>>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
>>>>>> <juhani_connolly@cyberagent.******co.jp <http://co.jp> <
>>>>>>
>>>>>> juhani_connolly@cyberagent.**c**o.jp <http://co.jp> <
>>>>>> juhani_connolly@cyberagent.**co.jp <[email protected]>
>>>>>> >>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>   I came upon an issue with ScribeSource,  though it's theoretically
>>>>>>
>>>>>>> applicable to any EventDrivenSource whose event generating thread(s)
>>>>>>> die.
>>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port
>>>>>>> will
>>>>>>> result in it trying to allocate space for some arbitrarily large
>>>>>>> packet
>>>>>>> resulting in an OOMException which kills the thread(incidentally I
>>>>>>> thought
>>>>>>> this would be an issue in avro too, but it throws an exception before
>>>>>>> making
>>>>>>> excessive allocation requests).
>>>>>>>
>>>>>>> As far as flume is concerned, the component is still alive. stop()
>>>>>>> was
>>>>>>> never
>>>>>>> called, so even monitoring the component state using jmx will not
>>>>>>> notice
>>>>>>> anything wrong. This situation occurs from user error, but there is
>>>>>>> potential for other errors leaving a zombie component. I think it
>>>>>>> would
>>>>>>> be
>>>>>>> more user friendly to be able to recover from such errors.
>>>>>>>
>>>>>>> I'm thinking of adding a StatusPollable interface that
>>>>>>> EventDrivenSources
>>>>>>> can optionally implement(because we can't change the interface
>>>>>>> without
>>>>>>> a
>>>>>>> version change). If implemented, the EventDrivenSourceRunner would
>>>>>>> schedule
>>>>>>> a regular poll to check the state. Upon failure it could either call
>>>>>>> stop()
>>>>>>> to signal it broke. With autoRestartPolicy, the source would then get
>>>>>>> restarted by its supervisor.
>>>>>>>
>>>>>>> Would appreciate any opinions before I put together a patch/post an
>>>>>>> issue.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>

Re: EventDrivenSource and dead threads

Reply via email to