Re: Critical Bug in Flume LogicalNode core logic

Jonathan Hsieh Fri, 12 Aug 2011 08:01:11 -0700

Here's the plan.  Eric, Satish, and I found different parts of the problem.
  Satish and I went down similar approaches and I have a rough draft for a
local fix of the immediate problem.  Eric and I chatted and we are convinced
that this is a reasonable fix for now.  We'll move forward on trunk with
local fix and the evolutionary approach.   We'll try to do a local refactor
after that attempting to keep mainline backwards compatible.


Eric is experimenting with a larger more fundamental refactor that could
simplify lifecycle throughput many components of the system.  This is being
explored in flume-728 branch and may get merged in in the future.

Jon.

On Thu, Aug 11, 2011 at 11:42 PM, Ralph Goers <[email protected]>wrote:

> While it is OK to talk about the problem off list, please post the result
> of your conversation back here.  Otherwise the private discussion didn't
> really happen.
>
> Ralph
>
> On Aug 11, 2011, at 2:25 AM, Jonathan Hsieh wrote:
>
> > Satish,
> >
> > The why question is what was really bugging me, and I was able to get
> into
> > the right state of mind to try to tacke this (note to self, pizza + beer
> ==
> > fuel for jon to deal with concurrency).
> >
> > I looked at your notes,  Eric's patch and logs, found a trace that
> generates
> > the problem, and found what I believe to the root cause.  I have attached
> a
> > patch to the jira [1] that is the quicker of the suggested fixes.
> >
> > The patch seems to fix the bug by the avoiding the root cause of the
> > problem.  It seems to be reliable when I did some quick manual tests.
>  Unit
> > tests on the full suite a running right now, and I still need to write
> unit
> > tests that would fail prior to the fix.  I'm going to try to post it to
> the
> > new apache reviewboard so you can take a look.
> >
> > Tomorrow, I'll try to chat with Eric to reconcile the different
> approaches,
> > and figure out the plan to merge the ideas from both and to figure out
> which
> > approach to take for the "official fix".  This is sufficiently
> complicated
> > that I'll try to document the functionality in the wiki.
> >
> > Thanks for your analysis, it really helped a lot!
> > Jon.
> >
> >
> >
> > On Wed, Aug 10, 2011 at 12:34 PM, Satish Eerpini <[email protected]>
> wrote:
> >
> >> Does anybody have ideas/fixes on how to tackle this ?
> >>
> >> what is the best way to stall this , I am thinking of delaying the
> >> heartbeat checking mechanism so that it waits for some time, before
> >> going to the nodesman.get() call, would that help ??
> >>
> >> I am also curious about why the behavior with the bug is so
> >> indeterminate, sometimes the second start goes through and restarts
> >> the ThriftEventSource after forcing the first one to shutdown, and
> >> "sometimes" it lands in a Bind error since the first one does not seem
> >> to have completely closed the listening on the port.
> >>
> >> Satish
> >>
> >> On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]>
> wrote:
> >>> Satish:
> >>>
> >>> You're 100% dead on. This is a real issue. I fixed this but the code
> >>> is so critical I've been afraid to commit it. It needs serious review.
> >>> I'm going to do my best to get the fix up (at least on a branch or
> >>> something) so folks can pick it apart in the next couple of days. The
> >>> JIRA tracking this one is
> >>> https://issues.apache.org/jira/browse/FLUME-706
> >>>
> >>> On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]>
> >> wrote:
> >>>> Hello folks,
> >>>>
> >>>>
> >>>> There seems to be a convoluted and critical bug in the control flow
> >>>> withing a logical node, I have been struggling with this issue for the
> >>>> past three weeks, here is the synopsis :
> >>>>
> >>>> Symptoms : using rpcSource starts the thriftEventSource twice,
> >>>> resulting in weird SocketExceptions on agent( because the second
> >>>> instance forces the driver to exit and closes the port on which
> >>>> thriftEventSource is listening, before bringing it up again)
> >>>>
> >>>> physical node : "collector"
> >>>> logical node : "mycollector" mapped to "collector"
> >>>>
> >>>> starting a rpcSource on mycollector seems to go through the following
> >> logic :
> >>>>
> >>>> HeartBeatThread ->
> >>>> heartbeatChecks() ->
> >>>> checkLogicalNodes() ->
> >>>> master.getLogicalNodes(physNode) returns "mycollector" ->
> >>>> nodesman.get("mycollector") is null in the for loop ->
> >>>> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
> >>>> nd.loadNodeDriver(src, snk) ->
> >>>> startNodeDriver()    ------------------ starts the thriftEventSource
> ...
> >>>>
> >>>> While the heartbeat thread is running, the CheckConfigThread is also
> >>>> running, which leads to another path where the same driver is started
> >>>> :
> >>>>
> >>>> the heartbeats which are en-queued by HeartBeatThread ->
> >>>> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
> >>>>
> >>>> are handled as follows :
> >>>>
> >>>> CheckConfigThread ->
> >>>> dequeueCheckConfig()->
> >>>> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
> >>>> it does not really make a difference here.
> >>>> loadConfig(data)->
> >>>> loadNodeDriver(newSrc, newSnk)->
> >>>> startNodeDriver()
> >>>>
> >>>> as you see the above two paths lead to the same driver being opened
> >>>> twice, so this leads to the second one which reaches driver.start() to
> >>>> force close the existing driver and open itself up again.
> >>>>
> >>>> another symptom is that when this happens, the heartbeats get backed
> >>>> up, which i believe is because, the dequeueCheckConfig() which follows
> >>>> the second path above, has to wait for timeout in driver.start() when
> >>>> it tries to close the existing driver.
> >>>>
> >>>> Please let me know if I have missed anything, any help will be greatly
> >>>> appreciated.
> >>>>
> >>>> Thanks
> >>>> Satish
> >>>> --
> >>>> http://satisheerpini.net
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Eric Sammer
> >>> twitter: esammer
> >>> data: www.cloudera.com
> >>>
> >>
> >>
> >>
> >> --
> >> http://satisheerpini.net
> >>
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // [email protected]
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]

Re: Critical Bug in Flume LogicalNode core logic

Reply via email to