What I described isn't really taking control of lifecycle.

What would happen is:
- source start. State=start
- OOM exception happens
- getLifecycleState called, override calls server.isServing(). This is a lightweight call. If it returns false, it either calls stop() or local lifeCycleState is set to ERROR (I'm not actually sure how this is handled). The latter is probably the correct call in this case, but I'm not entirely sure if the supervisor will ever restart it then.

The supervisor would still be doing any restarting. I guess it's really a matter of who detects the bad state(getLifecycleState or a scheduled runnable) and what that call is allowed to do(switch state to ERROR or call stop()). When the source breaks, changing its status to accurately reflect this is not breaking the paradigm(though calling stop() may be)

It looks to me like switching the state to ERROR should be sufficient... Monitor should then try to start the source again. However not calling stop() before that may cause resource leakage? Could someone confirm the behavior in regards to ERROR?

On 01/17/2013 04:02 PM, Connor Woodson wrote:
What I was trying to say is that the ScribeSource should not be responsible
for restarting itself (just from my understanding of your idea; it breaks
the existing paradigm as components should not have to control their own
lifecycle). Going from Brock's link, I feel the most dynamic solution would
be possible to just add in a lifecyle state RESTART and place it in that
switch statement; when that state is reached, it tries to stop then restart
the component and then sets the desired state to START (or STOP if it
couldn't start it again, or set error=true if it couldn't stop it).

And in a way to prevent overriding getLifecycleState to return RESTART,
there could either be an inherited function from AbstractSource to call for
a restart, there could maybe be an option to set it to RESTART on crash, or
something else.

I'll admit though that I know little about the lifecycle system, so I have
no idea if this idea is any better.

- Connor


On Wed, Jan 16, 2013 at 10:21 PM, Juhani Connolly <
[email protected]> wrote:

Sink, Source and Channel all extend LifecycleAware, so the function is
available to all components already.

I was more questioning whether it was reasonable to start including logic
to determine the state. That being said, I think the precedent of just
returning the state set by start/stop is more one of habit, so on further
thought I don't see it as being unreasonable. I'm going to give fixing
ScribeSource with it a poke.

As to new lifecycle states, I took a pass at reworking the lifecycle model
with guavas service implementation in the past, but it took some very
significant changes and didn't get the momentum/interest necessary to keep
working on it. That ticket is here https://issues.apache.org/**
jira/browse/FLUME-966 <https://issues.apache.org/jira/browse/FLUME-966> .
Brock might be working on it now though the issue doesn't appear to have
had attention since october.


On 01/17/2013 03:01 PM, Connor Woodson wrote:

Why limit it to the sources? If there is going to be a change to one
component's lifecycle, then I see no reason not to change every
component's
lifecycle.

Sinks and Channels could very well have this problem; so what about giving
each LifecycleAware component a takePulse method (or something); or to
avoid creating a new method, add a new lifecycle state 'crashed' or such
which, when detected, causes a restart of the component. Then components
would just need to override getLifecycleState (if this method is polled
regularly; I don't know. maybe use a listener for when there's a state
change) to detect if it has crashed/needs to be restarted.

Just my thoughts,

- Connor


On Wed, Jan 16, 2013 at 9:08 PM, Juhani Connolly <
[email protected].**jp <[email protected]>>
wrote:

  Hmm, overriding the implementation of getLifecycleState provided by
AbstractSource could work. It would be going against the convention that
has been maintained in all other components(that I can think of)


On 01/17/2013 01:20 PM, Brock Noland wrote:

  Hi,
Yes I can definitely see the issue. It sucks that we'd have to add yet
another thread. An alternative which wouldn't require another thread
would be to check the optional interface in the supervisor,
approximately here:

https://github.com/apache/****flume/blob/trunk/flume-ng-**<https://github.com/apache/**flume/blob/trunk/flume-ng-**>
core/src/main/java/org/apache/****flume/lifecycle/**
LifecycleSupervisor.java#L240<**https://github.com/apache/**
flume/blob/trunk/flume-ng-**core/src/main/java/org/apache/**
flume/lifecycle/**LifecycleSupervisor.java#L240<https://github.com/apache/flume/blob/trunk/flume-ng-core/src/main/java/org/apache/flume/lifecycle/LifecycleSupervisor.java#L240>

However, I am not sold on the supervisor being the best place to fix
this as I am not sure that other lifecycle components would need this.



Brock

On Wed, Jan 16, 2013 at 7:45 PM, Juhani Connolly
<juhani_connolly@cyberagent.****co.jp <http://co.jp> <
juhani_connolly@cyberagent.**co.jp <[email protected]>>>

wrote:

  I came upon an issue with ScribeSource,  though it's theoretically
applicable to any EventDrivenSource whose event generating thread(s)
die.
Simple put, sending a bad packet to the thrift(scribe protocol) port
will
result in it trying to allocate space for some arbitrarily large packet
resulting in an OOMException which kills the thread(incidentally I
thought
this would be an issue in avro too, but it throws an exception before
making
excessive allocation requests).

As far as flume is concerned, the component is still alive. stop() was
never
called, so even monitoring the component state using jmx will not
notice
anything wrong. This situation occurs from user error, but there is
potential for other errors leaving a zombie component. I think it would
be
more user friendly to be able to recover from such errors.

I'm thinking of adding a StatusPollable interface that
EventDrivenSources
can optionally implement(because we can't change the interface without
a
version change). If implemented, the EventDrivenSourceRunner would
schedule
a regular poll to check the state. Upon failure it could either call
stop()
to signal it broke. With autoRestartPolicy, the source would then get
restarted by its supervisor.

Would appreciate any opinions before I put together a patch/post an
issue.




Reply via email to