Satish:

You're 100% dead on. This is a real issue. I fixed this but the code
is so critical I've been afraid to commit it. It needs serious review.
I'm going to do my best to get the fix up (at least on a branch or
something) so folks can pick it apart in the next couple of days. The
JIRA tracking this one is
https://issues.apache.org/jira/browse/FLUME-706

On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> wrote:
> Hello folks,
>
>
> There seems to be a convoluted and critical bug in the control flow
> withing a logical node, I have been struggling with this issue for the
> past three weeks, here is the synopsis :
>
> Symptoms : using rpcSource starts the thriftEventSource twice,
> resulting in weird SocketExceptions on agent( because the second
> instance forces the driver to exit and closes the port on which
> thriftEventSource is listening, before bringing it up again)
>
> physical node : "collector"
> logical node : "mycollector" mapped to "collector"
>
> starting a rpcSource on mycollector seems to go through the following logic :
>
> HeartBeatThread ->
> heartbeatChecks() ->
> checkLogicalNodes() ->
> master.getLogicalNodes(physNode) returns "mycollector" ->
> nodesman.get("mycollector") is null in the for loop ->
> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) ->
> nd.loadNodeDriver(src, snk) ->
> startNodeDriver()    ------------------ starts the thriftEventSource ...
>
> While the heartbeat thread is running, the CheckConfigThread is also
> running, which leads to another path where the same driver is started
> :
>
> the heartbeats which are en-queued by HeartBeatThread ->
> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig()
>
> are handled as follows :
>
> CheckConfigThread ->
> dequeueCheckConfig()->
> ln.checkConfig(fcd) ->    ------- though checkConfig is synchronized
> it does not really make a difference here.
> loadConfig(data)->
> loadNodeDriver(newSrc, newSnk)->
> startNodeDriver()
>
> as you see the above two paths lead to the same driver being opened
> twice, so this leads to the second one which reaches driver.start() to
> force close the existing driver and open itself up again.
>
> another symptom is that when this happens, the heartbeats get backed
> up, which i believe is because, the dequeueCheckConfig() which follows
> the second path above, has to wait for timeout in driver.start() when
> it tries to close the existing driver.
>
> Please let me know if I have missed anything, any help will be greatly
> appreciated.
>
> Thanks
> Satish
> --
> http://satisheerpini.net
>



-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com

Reply via email to