Re: Critical Bug in Flume LogicalNode core logic

Jonathan Hsieh Thu, 11 Aug 2011 02:27:15 -0700

Forgot link:
[1] https://issues.apache.org/jira/browse/FLUME-706


On Thu, Aug 11, 2011 at 2:25 AM, Jonathan Hsieh <[email protected]> wrote:

> Satish,
>
> The why question is what was really bugging me, and I was able to get into
> the right state of mind to try to tacke this (note to self, pizza + beer ==
> fuel for jon to deal with concurrency).
>
> I looked at your notes,  Eric's patch and logs, found a trace that
> generates the problem, and found what I believe to the root cause.  I have
> attached a patch to the jira [1] that is the quicker of the suggested
> fixes.
>
> The patch seems to fix the bug by the avoiding the root cause of the
> problem.  It seems to be reliable when I did some quick manual tests.  Unit
> tests on the full suite a running right now, and I still need to write unit
> tests that would fail prior to the fix.  I'm going to try to post it to the
> new apache reviewboard so you can take a look.
>
> Tomorrow, I'll try to chat with Eric to reconcile the different approaches,
> and figure out the plan to merge the ideas from both and to figure out which
> approach to take for the "official fix".  This is sufficiently complicated
> that I'll try to document the functionality in the wiki.
>
> Thanks for your analysis, it really helped a lot!
> Jon.
>
>
>
> On Wed, Aug 10, 2011 at 12:34 PM, Satish Eerpini <[email protected]>wrote:
>
>> Does anybody have ideas/fixes on how to tackle this ?
>>
>> what is the best way to stall this , I am thinking of delaying the
>> heartbeat checking mechanism so that it waits for some time, before
>> going to the nodesman.get() call, would that help ??
>>
>> I am also curious about why the behavior with the bug is so
>> indeterminate, sometimes the second start goes through and restarts
>> the ThriftEventSource after forcing the first one to shutdown, and
>> "sometimes" it lands in a Bind error since the first one does not seem
>> to have completely closed the listening on the port.
>>
>> Satish
>>
>> On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote:
>> > Satish:
>> >
>> > You're 100% dead on. This is a real issue. I fixed this but the code
>> > is so critical I've been afraid to commit it. It needs serious review.
>> > I'm going to do my best to get the fix up (at least on a branch or
>> > something) so folks can pick it apart in the next couple of days. The
>> > JIRA tracking this one is
>> > https://issues.apache.org/jira/browse/FLUME-706
>> >
>> > On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]>
>> wrote:
>> >> Hello folks,
>> >>
>> >>
>> >> There seems to be a convoluted and critical bug in the control flow
>> >> withing a logical node, I have been struggling with this issue for the
>> >> past three weeks, here is the synopsis :
>> >>
>> >> Symptoms : using rpcSource starts the thriftEventSource twice,
>> >> resulting in weird SocketExceptions on agent( because the second
>> >> instance forces the driver to exit and closes the port on which
>> >> thriftEventSource is listening, before bringing it up again)
>> >>
>> >> physical node : "collector"
>> >> logical node : "mycollector" mapped to "collector"
>> >>
>> >> starting a rpcSource on mycollector seems to go through the following
>> logic :
>> >>
>> >> HeartBeatThread ->
>> >> heartbeatChecks() ->
>> >> checkLogicalNodes() ->
>> >> master.getLogicalNodes(physNode) returns "mycollector" ->
>> >> nodesman.get("mycollector") is null in the for loop ->
>> >> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
>> >> nd.loadNodeDriver(src, snk) ->
>> >> startNodeDriver()    ------------------ starts the thriftEventSource
>> ...
>> >>
>> >> While the heartbeat thread is running, the CheckConfigThread is also
>> >> running, which leads to another path where the same driver is started
>> >> :
>> >>
>> >> the heartbeats which are en-queued by HeartBeatThread ->
>> >> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
>> >>
>> >> are handled as follows :
>> >>
>> >> CheckConfigThread ->
>> >> dequeueCheckConfig()->
>> >> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
>> >> it does not really make a difference here.
>> >> loadConfig(data)->
>> >> loadNodeDriver(newSrc, newSnk)->
>> >> startNodeDriver()
>> >>
>> >> as you see the above two paths lead to the same driver being opened
>> >> twice, so this leads to the second one which reaches driver.start() to
>> >> force close the existing driver and open itself up again.
>> >>
>> >> another symptom is that when this happens, the heartbeats get backed
>> >> up, which i believe is because, the dequeueCheckConfig() which follows
>> >> the second path above, has to wait for timeout in driver.start() when
>> >> it tries to close the existing driver.
>> >>
>> >> Please let me know if I have missed anything, any help will be greatly
>> >> appreciated.
>> >>
>> >> Thanks
>> >> Satish
>> >> --
>> >> http://satisheerpini.net
>> >>
>> >
>> >
>> >
>> > --
>> > Eric Sammer
>> > twitter: esammer
>> > data: www.cloudera.com
>> >
>>
>>
>>
>> --
>> http://satisheerpini.net
>>
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // [email protected]
>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]

Re: Critical Bug in Flume LogicalNode core logic

Reply via email to