Critical Bug in Flume LogicalNode core logic

Satish Eerpini Tue, 09 Aug 2011 17:07:45 -0700

Hello folks,


There seems to be a convoluted and critical bug in the control flow
withing a logical node, I have been struggling with this issue for the
past three weeks, here is the synopsis :

Symptoms : using rpcSource starts the thriftEventSource twice,
resulting in weird SocketExceptions on agent( because the second
instance forces the driver to exit and closes the port on which
thriftEventSource is listening, before bringing it up again)

physical node : "collector"
logical node : "mycollector" mapped to "collector"

starting a rpcSource on mycollector seems to go through the following logic :

HeartBeatThread ->
heartbeatChecks() ->
checkLogicalNodes() ->
master.getLogicalNodes(physNode) returns "mycollector" ->
nodesman.get("mycollector") is null in the for loop ->
nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
nd.loadNodeDriver(src, snk) ->
startNodeDriver()    ------------------ starts the thriftEventSource ...

While the heartbeat thread is running, the CheckConfigThread is also
running, which leads to another path where the same driver is started
:

the heartbeats which are en-queued by HeartBeatThread ->
heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()

are handled as follows :

CheckConfigThread ->
dequeueCheckConfig()->
ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
it does not really make a difference here.
loadConfig(data)->
loadNodeDriver(newSrc, newSnk)->
startNodeDriver()

as you see the above two paths lead to the same driver being opened
twice, so this leads to the second one which reaches driver.start() to
force close the existing driver and open itself up again.

another symptom is that when this happens, the heartbeats get backed
up, which i believe is because, the dequeueCheckConfig() which follows
the second path above, has to wait for timeout in driver.start() when
it tries to close the existing driver.

Please let me know if I have missed anything, any help will be greatly
appreciated.

Thanks
Satish
-- 
http://satisheerpini.net

Critical Bug in Flume LogicalNode core logic

Reply via email to