Re: Critical Bug in Flume LogicalNode core logic

Jonathan Hsieh Thu, 11 Aug 2011 02:26:33 -0700

Satish,

The why question is what was really bugging me, and I was able to get into
the right state of mind to try to tacke this (note to self, pizza + beer ==
fuel for jon to deal with concurrency).


I looked at your notes,  Eric's patch and logs, found a trace that generates
the problem, and found what I believe to the root cause.  I have attached a
patch to the jira [1] that is the quicker of the suggested fixes.

The patch seems to fix the bug by the avoiding the root cause of the
problem.  It seems to be reliable when I did some quick manual tests.  Unit
tests on the full suite a running right now, and I still need to write unit
tests that would fail prior to the fix.  I'm going to try to post it to the
new apache reviewboard so you can take a look.

Tomorrow, I'll try to chat with Eric to reconcile the different approaches,
and figure out the plan to merge the ideas from both and to figure out which
approach to take for the "official fix".  This is sufficiently complicated
that I'll try to document the functionality in the wiki.

Thanks for your analysis, it really helped a lot!
Jon.



On Wed, Aug 10, 2011 at 12:34 PM, Satish Eerpini <[email protected]> wrote:

> Does anybody have ideas/fixes on how to tackle this ?
>
> what is the best way to stall this , I am thinking of delaying the
> heartbeat checking mechanism so that it waits for some time, before
> going to the nodesman.get() call, would that help ??
>
> I am also curious about why the behavior with the bug is so
> indeterminate, sometimes the second start goes through and restarts
> the ThriftEventSource after forcing the first one to shutdown, and
> "sometimes" it lands in a Bind error since the first one does not seem
> to have completely closed the listening on the port.
>
> Satish
>
> On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote:
> > Satish:
> >
> > You're 100% dead on. This is a real issue. I fixed this but the code
> > is so critical I've been afraid to commit it. It needs serious review.
> > I'm going to do my best to get the fix up (at least on a branch or
> > something) so folks can pick it apart in the next couple of days. The
> > JIRA tracking this one is
> > https://issues.apache.org/jira/browse/FLUME-706
> >
> > On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]>
> wrote:
> >> Hello folks,
> >>
> >>
> >> There seems to be a convoluted and critical bug in the control flow
> >> withing a logical node, I have been struggling with this issue for the
> >> past three weeks, here is the synopsis :
> >>
> >> Symptoms : using rpcSource starts the thriftEventSource twice,
> >> resulting in weird SocketExceptions on agent( because the second
> >> instance forces the driver to exit and closes the port on which
> >> thriftEventSource is listening, before bringing it up again)
> >>
> >> physical node : "collector"
> >> logical node : "mycollector" mapped to "collector"
> >>
> >> starting a rpcSource on mycollector seems to go through the following
> logic :
> >>
> >> HeartBeatThread ->
> >> heartbeatChecks() ->
> >> checkLogicalNodes() ->
> >> master.getLogicalNodes(physNode) returns "mycollector" ->
> >> nodesman.get("mycollector") is null in the for loop ->
> >> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
> >> nd.loadNodeDriver(src, snk) ->
> >> startNodeDriver()    ------------------ starts the thriftEventSource ...
> >>
> >> While the heartbeat thread is running, the CheckConfigThread is also
> >> running, which leads to another path where the same driver is started
> >> :
> >>
> >> the heartbeats which are en-queued by HeartBeatThread ->
> >> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
> >>
> >> are handled as follows :
> >>
> >> CheckConfigThread ->
> >> dequeueCheckConfig()->
> >> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
> >> it does not really make a difference here.
> >> loadConfig(data)->
> >> loadNodeDriver(newSrc, newSnk)->
> >> startNodeDriver()
> >>
> >> as you see the above two paths lead to the same driver being opened
> >> twice, so this leads to the second one which reaches driver.start() to
> >> force close the existing driver and open itself up again.
> >>
> >> another symptom is that when this happens, the heartbeats get backed
> >> up, which i believe is because, the dequeueCheckConfig() which follows
> >> the second path above, has to wait for timeout in driver.start() when
> >> it tries to close the existing driver.
> >>
> >> Please let me know if I have missed anything, any help will be greatly
> >> appreciated.
> >>
> >> Thanks
> >> Satish
> >> --
> >> http://satisheerpini.net
> >>
> >
> >
> >
> > --
> > Eric Sammer
> > twitter: esammer
> > data: www.cloudera.com
> >
>
>
>
> --
> http://satisheerpini.net
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]

Re: Critical Bug in Flume LogicalNode core logic

Reply via email to