Eric , thanks for the quick response, can you share the patch so that we can test it ?
Satish On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote: > Satish: > > You're 100% dead on. This is a real issue. I fixed this but the code > is so critical I've been afraid to commit it. It needs serious review. > I'm going to do my best to get the fix up (at least on a branch or > something) so folks can pick it apart in the next couple of days. The > JIRA tracking this one is > https://issues.apache.org/jira/browse/FLUME-706 > > On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> wrote: >> Hello folks, >> >> >> There seems to be a convoluted and critical bug in the control flow >> withing a logical node, I have been struggling with this issue for the >> past three weeks, here is the synopsis : >> >> Symptoms : using rpcSource starts the thriftEventSource twice, >> resulting in weird SocketExceptions on agent( because the second >> instance forces the driver to exit and closes the port on which >> thriftEventSource is listening, before bringing it up again) >> >> physical node : "collector" >> logical node : "mycollector" mapped to "collector" >> >> starting a rpcSource on mycollector seems to go through the following logic : >> >> HeartBeatThread -> >> heartbeatChecks() -> >> checkLogicalNodes() -> >> master.getLogicalNodes(physNode) returns "mycollector" -> >> nodesman.get("mycollector") is null in the for loop -> >> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) -> >> nd.loadNodeDriver(src, snk) -> >> startNodeDriver() ------------------ starts the thriftEventSource ... >> >> While the heartbeat thread is running, the CheckConfigThread is also >> running, which leads to another path where the same driver is started >> : >> >> the heartbeats which are en-queued by HeartBeatThread -> >> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig() >> >> are handled as follows : >> >> CheckConfigThread -> >> dequeueCheckConfig()-> >> ln.checkConfig(fcd) -> ------- though checkConfig is synchronized >> it does not really make a difference here. >> loadConfig(data)-> >> loadNodeDriver(newSrc, newSnk)-> >> startNodeDriver() >> >> as you see the above two paths lead to the same driver being opened >> twice, so this leads to the second one which reaches driver.start() to >> force close the existing driver and open itself up again. >> >> another symptom is that when this happens, the heartbeats get backed >> up, which i believe is because, the dequeueCheckConfig() which follows >> the second path above, has to wait for timeout in driver.start() when >> it tries to close the existing driver. >> >> Please let me know if I have missed anything, any help will be greatly >> appreciated. >> >> Thanks >> Satish >> -- >> http://satisheerpini.net >> > > > > -- > Eric Sammer > twitter: esammer > data: www.cloudera.com > -- http://satisheerpini.net
