Re: Critical Bug in Flume LogicalNode core logic

Satish Eerpini Tue, 09 Aug 2011 18:40:36 -0700

Eric ,

thanks for the quick response, can you share the patch so that we can test it ?



Satish

On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote:
> Satish:
>
> You're 100% dead on. This is a real issue. I fixed this but the code
> is so critical I've been afraid to commit it. It needs serious review.
> I'm going to do my best to get the fix up (at least on a branch or
> something) so folks can pick it apart in the next couple of days. The
> JIRA tracking this one is
> https://issues.apache.org/jira/browse/FLUME-706
>
> On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> wrote:
>> Hello folks,
>>
>>
>> There seems to be a convoluted and critical bug in the control flow
>> withing a logical node, I have been struggling with this issue for the
>> past three weeks, here is the synopsis :
>>
>> Symptoms : using rpcSource starts the thriftEventSource twice,
>> resulting in weird SocketExceptions on agent( because the second
>> instance forces the driver to exit and closes the port on which
>> thriftEventSource is listening, before bringing it up again)
>>
>> physical node : "collector"
>> logical node : "mycollector" mapped to "collector"
>>
>> starting a rpcSource on mycollector seems to go through the following logic :
>>
>> HeartBeatThread ->
>> heartbeatChecks() ->
>> checkLogicalNodes() ->
>> master.getLogicalNodes(physNode) returns "mycollector" ->
>> nodesman.get("mycollector") is null in the for loop ->
>> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
>> nd.loadNodeDriver(src, snk) ->
>> startNodeDriver()    ------------------ starts the thriftEventSource ...
>>
>> While the heartbeat thread is running, the CheckConfigThread is also
>> running, which leads to another path where the same driver is started
>> :
>>
>> the heartbeats which are en-queued by HeartBeatThread ->
>> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
>>
>> are handled as follows :
>>
>> CheckConfigThread ->
>> dequeueCheckConfig()->
>> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
>> it does not really make a difference here.
>> loadConfig(data)->
>> loadNodeDriver(newSrc, newSnk)->
>> startNodeDriver()
>>
>> as you see the above two paths lead to the same driver being opened
>> twice, so this leads to the second one which reaches driver.start() to
>> force close the existing driver and open itself up again.
>>
>> another symptom is that when this happens, the heartbeats get backed
>> up, which i believe is because, the dequeueCheckConfig() which follows
>> the second path above, has to wait for timeout in driver.start() when
>> it tries to close the existing driver.
>>
>> Please let me know if I have missed anything, any help will be greatly
>> appreciated.
>>
>> Thanks
>> Satish
>> --
>> http://satisheerpini.net
>>
>
>
>
> --
> Eric Sammer
> twitter: esammer
> data: www.cloudera.com
>



-- 
http://satisheerpini.net

Re: Critical Bug in Flume LogicalNode core logic

Reply via email to