I don't know if this is within the scope of this change, but sinks sometimes will need to be restarted; for instance, I had an HDFS sink crash from an Out of Memory error (caused by the JIRA I filed; I wasn't planning on using a large heap) and it doesn't automatically restart; I don't know if there's a nice way to detect of a sink/source has crashed, but if so it would be nice to have them auto-restart when they go down.
- Connor On Thu, Jan 17, 2013 at 12:27 AM, Juhani Connolly < [email protected]> wrote: > What I described isn't really taking control of lifecycle. > > What would happen is: > - source start. State=start > - OOM exception happens > - getLifecycleState called, override calls server.isServing(). This is a > lightweight call. If it returns false, it either calls stop() or local > lifeCycleState is set to ERROR (I'm not actually sure how this is handled). > The latter is probably the correct call in this case, but I'm not entirely > sure if the supervisor will ever restart it then. > > The supervisor would still be doing any restarting. I guess it's really a > matter of who detects the bad state(getLifecycleState or a scheduled > runnable) and what that call is allowed to do(switch state to ERROR or call > stop()). When the source breaks, changing its status to accurately reflect > this is not breaking the paradigm(though calling stop() may be) > > It looks to me like switching the state to ERROR should be sufficient... > Monitor should then try to start the source again. However not calling > stop() before that may cause resource leakage? Could someone confirm the > behavior in regards to ERROR? > > > On 01/17/2013 04:02 PM, Connor Woodson wrote: > >> What I was trying to say is that the ScribeSource should not be >> responsible >> for restarting itself (just from my understanding of your idea; it breaks >> the existing paradigm as components should not have to control their own >> lifecycle). Going from Brock's link, I feel the most dynamic solution >> would >> be possible to just add in a lifecyle state RESTART and place it in that >> switch statement; when that state is reached, it tries to stop then >> restart >> the component and then sets the desired state to START (or STOP if it >> couldn't start it again, or set error=true if it couldn't stop it). >> >> And in a way to prevent overriding getLifecycleState to return RESTART, >> there could either be an inherited function from AbstractSource to call >> for >> a restart, there could maybe be an option to set it to RESTART on crash, >> or >> something else. >> >> I'll admit though that I know little about the lifecycle system, so I have >> no idea if this idea is any better. >> >> - Connor >> >> >> On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly < >> [email protected].**jp <[email protected]>> >> wrote: >> >> Sink, Source and Channel all extend LifecycleAware, so the function is >>> available to all components already. >>> >>> I was more questioning whether it was reasonable to start including logic >>> to determine the state. That being said, I think the precedent of just >>> returning the state set by start/stop is more one of habit, so on further >>> thought I don't see it as being unreasonable. I'm going to give fixing >>> ScribeSource with it a poke. >>> >>> As to new lifecycle states, I took a pass at reworking the lifecycle >>> model >>> with guavas service implementation in the past, but it took some very >>> significant changes and didn't get the momentum/interest necessary to >>> keep >>> working on it. That ticket is here https://issues.apache.org/** >>> jira/browse/FLUME-966 >>> <https://issues.apache.org/**jira/browse/FLUME-966<https://issues.apache.org/jira/browse/FLUME-966>> >>> . >>> >>> Brock might be working on it now though the issue doesn't appear to have >>> had attention since october. >>> >>> >>> On 01/17/2013 03:01 PM, Connor Woodson wrote: >>> >>> Why limit it to the sources? If there is going to be a change to one >>>> component's lifecycle, then I see no reason not to change every >>>> component's >>>> lifecycle. >>>> >>>> Sinks and Channels could very well have this problem; so what about >>>> giving >>>> each LifecycleAware component a takePulse method (or something); or to >>>> avoid creating a new method, add a new lifecycle state 'crashed' or such >>>> which, when detected, causes a restart of the component. Then components >>>> would just need to override getLifecycleState (if this method is polled >>>> regularly; I don't know. maybe use a listener for when there's a state >>>> change) to detect if it has crashed/needs to be restarted. >>>> >>>> Just my thoughts, >>>> >>>> - Connor >>>> >>>> >>>> On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly < >>>> [email protected].****jp <juhani_connolly@cyberagent.** >>>> co.jp <[email protected]>>> >>>> >>>> wrote: >>>> >>>> Hmm, overriding the implementation of getLifecycleState provided by >>>> >>>>> AbstractSource could work. It would be going against the convention >>>>> that >>>>> has been maintained in all other components(that I can think of) >>>>> >>>>> >>>>> On 01/17/2013 01:20 PM, Brock Noland wrote: >>>>> >>>>> Hi, >>>>> >>>>>> Yes I can definitely see the issue. It sucks that we'd have to add yet >>>>>> another thread. An alternative which wouldn't require another thread >>>>>> would be to check the optional interface in the supervisor, >>>>>> approximately here: >>>>>> >>>>>> https://github.com/apache/******flume/blob/trunk/flume-ng-**<https://github.com/apache/****flume/blob/trunk/flume-ng-**> >>>>>> <h**ttps://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**> >>>>>> > >>>>>> core/src/main/java/org/apache/******flume/lifecycle/** >>>>>> LifecycleSupervisor.java#L240<****https://github.com/apache/** >>>>>> >>>>>> flume/blob/trunk/flume-ng-****core/src/main/java/org/apache/**** >>>>>> flume/lifecycle/****LifecycleSupervisor.java#L240<** >>>>>> https://github.com/apache/**flume/blob/trunk/flume-ng-** >>>>>> core/src/main/java/org/apache/**flume/lifecycle/** >>>>>> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240> >>>>>> > >>>>>> >>>>>> However, I am not sold on the supervisor being the best place to fix >>>>>> this as I am not sure that other lifecycle components would need this. >>>>>> >>>>>> >>>>>> >>>>>> Brock >>>>>> >>>>>> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly >>>>>> <juhani_connolly@cyberagent.******co.jp <http://co.jp> < >>>>>> >>>>>> juhani_connolly@cyberagent.**c**o.jp <http://co.jp> < >>>>>> juhani_connolly@cyberagent.**co.jp <[email protected]> >>>>>> >>> >>>>>> >>>>>> wrote: >>>>>> >>>>>> I came upon an issue with ScribeSource, though it's theoretically >>>>>> >>>>>>> applicable to any EventDrivenSource whose event generating thread(s) >>>>>>> die. >>>>>>> Simple put, sending a bad packet to the thrift(scribe protocol) port >>>>>>> will >>>>>>> result in it trying to allocate space for some arbitrarily large >>>>>>> packet >>>>>>> resulting in an OOMException which kills the thread(incidentally I >>>>>>> thought >>>>>>> this would be an issue in avro too, but it throws an exception before >>>>>>> making >>>>>>> excessive allocation requests). >>>>>>> >>>>>>> As far as flume is concerned, the component is still alive. stop() >>>>>>> was >>>>>>> never >>>>>>> called, so even monitoring the component state using jmx will not >>>>>>> notice >>>>>>> anything wrong. This situation occurs from user error, but there is >>>>>>> potential for other errors leaving a zombie component. I think it >>>>>>> would >>>>>>> be >>>>>>> more user friendly to be able to recover from such errors. >>>>>>> >>>>>>> I'm thinking of adding a StatusPollable interface that >>>>>>> EventDrivenSources >>>>>>> can optionally implement(because we can't change the interface >>>>>>> without >>>>>>> a >>>>>>> version change). If implemented, the EventDrivenSourceRunner would >>>>>>> schedule >>>>>>> a regular poll to check the state. Upon failure it could either call >>>>>>> stop() >>>>>>> to signal it broke. With autoRestartPolicy, the source would then get >>>>>>> restarted by its supervisor. >>>>>>> >>>>>>> Would appreciate any opinions before I put together a patch/post an >>>>>>> issue. >>>>>>> >>>>>>> >>>>>>> >>>>>> >
