Why limit it to the sources? If there is going to be a change to one component's lifecycle, then I see no reason not to change every component's lifecycle.
Sinks and Channels could very well have this problem; so what about giving each LifecycleAware component a takePulse method (or something); or to avoid creating a new method, add a new lifecycle state 'crashed' or such which, when detected, causes a restart of the component. Then components would just need to override getLifecycleState (if this method is polled regularly; I don't know. maybe use a listener for when there's a state change) to detect if it has crashed/needs to be restarted. Just my thoughts, - Connor On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly < [email protected]> wrote: > Hmm, overriding the implementation of getLifecycleState provided by > AbstractSource could work. It would be going against the convention that > has been maintained in all other components(that I can think of) > > > On 01/17/2013 01:20 PM, Brock Noland wrote: > >> Hi, >> >> Yes I can definitely see the issue. It sucks that we'd have to add yet >> another thread. An alternative which wouldn't require another thread >> would be to check the optional interface in the supervisor, >> approximately here: >> >> https://github.com/apache/**flume/blob/trunk/flume-ng-** >> core/src/main/java/org/apache/**flume/lifecycle/** >> LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240> >> >> However, I am not sold on the supervisor being the best place to fix >> this as I am not sure that other lifecycle components would need this. >> >> >> >> Brock >> >> On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly >> <juhani_connolly@cyberagent.**co.jp <[email protected]>> >> wrote: >> >>> I came upon an issue with ScribeSource, though it's theoretically >>> applicable to any EventDrivenSource whose event generating thread(s) die. >>> Simple put, sending a bad packet to the thrift(scribe protocol) port will >>> result in it trying to allocate space for some arbitrarily large packet >>> resulting in an OOMException which kills the thread(incidentally I >>> thought >>> this would be an issue in avro too, but it throws an exception before >>> making >>> excessive allocation requests). >>> >>> As far as flume is concerned, the component is still alive. stop() was >>> never >>> called, so even monitoring the component state using jmx will not notice >>> anything wrong. This situation occurs from user error, but there is >>> potential for other errors leaving a zombie component. I think it would >>> be >>> more user friendly to be able to recover from such errors. >>> >>> I'm thinking of adding a StatusPollable interface that EventDrivenSources >>> can optionally implement(because we can't change the interface without a >>> version change). If implemented, the EventDrivenSourceRunner would >>> schedule >>> a regular poll to check the state. Upon failure it could either call >>> stop() >>> to signal it broke. With autoRestartPolicy, the source would then get >>> restarted by its supervisor. >>> >>> Would appreciate any opinions before I put together a patch/post an >>> issue. >>> >> >> >> >
