Re: Critical Bug in Flume LogicalNode core logic

Satish Eerpini Wed, 10 Aug 2011 12:34:58 -0700

Does anybody have ideas/fixes on how to tackle this ?

what is the best way to stall this , I am thinking of delaying the
heartbeat checking mechanism so that it waits for some time, before
going to the nodesman.get() call, would that help ??


I am also curious about why the behavior with the bug is so
indeterminate, sometimes the second start goes through and restarts
the ThriftEventSource after forcing the first one to shutdown, and
"sometimes" it lands in a Bind error since the first one does not seem
to have completely closed the listening on the port.

Satish

On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote:
> Satish:
>
> You're 100% dead on. This is a real issue. I fixed this but the code
> is so critical I've been afraid to commit it. It needs serious review.
> I'm going to do my best to get the fix up (at least on a branch or
> something) so folks can pick it apart in the next couple of days. The
> JIRA tracking this one is
> https://issues.apache.org/jira/browse/FLUME-706
>
> On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> wrote:
>> Hello folks,
>>
>>
>> There seems to be a convoluted and critical bug in the control flow
>> withing a logical node, I have been struggling with this issue for the
>> past three weeks, here is the synopsis :
>>
>> Symptoms : using rpcSource starts the thriftEventSource twice,
>> resulting in weird SocketExceptions on agent( because the second
>> instance forces the driver to exit and closes the port on which
>> thriftEventSource is listening, before bringing it up again)
>>
>> physical node : "collector"
>> logical node : "mycollector" mapped to "collector"
>>
>> starting a rpcSource on mycollector seems to go through the following logic :
>>
>> HeartBeatThread ->
>> heartbeatChecks() ->
>> checkLogicalNodes() ->
>> master.getLogicalNodes(physNode) returns "mycollector" ->
>> nodesman.get("mycollector") is null in the for loop ->
>> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
>> nd.loadNodeDriver(src, snk) ->
>> startNodeDriver()    ------------------ starts the thriftEventSource ...
>>
>> While the heartbeat thread is running, the CheckConfigThread is also
>> running, which leads to another path where the same driver is started
>> :
>>
>> the heartbeats which are en-queued by HeartBeatThread ->
>> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
>>
>> are handled as follows :
>>
>> CheckConfigThread ->
>> dequeueCheckConfig()->
>> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
>> it does not really make a difference here.
>> loadConfig(data)->
>> loadNodeDriver(newSrc, newSnk)->
>> startNodeDriver()
>>
>> as you see the above two paths lead to the same driver being opened
>> twice, so this leads to the second one which reaches driver.start() to
>> force close the existing driver and open itself up again.
>>
>> another symptom is that when this happens, the heartbeats get backed
>> up, which i believe is because, the dequeueCheckConfig() which follows
>> the second path above, has to wait for timeout in driver.start() when
>> it tries to close the existing driver.
>>
>> Please let me know if I have missed anything, any help will be greatly
>> appreciated.
>>
>> Thanks
>> Satish
>> --
>> http://satisheerpini.net
>>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>



-- 
http://satisheerpini.net

Re: Critical Bug in Flume LogicalNode core logic

Reply via email to