Forgot link: [1] https://issues.apache.org/jira/browse/FLUME-706
On Thu, Aug 11, 2011 at 2:25 AM, Jonathan Hsieh <[email protected]> wrote: > Satish, > > The why question is what was really bugging me, and I was able to get into > the right state of mind to try to tacke this (note to self, pizza + beer == > fuel for jon to deal with concurrency). > > I looked at your notes, Eric's patch and logs, found a trace that > generates the problem, and found what I believe to the root cause. I have > attached a patch to the jira [1] that is the quicker of the suggested > fixes. > > The patch seems to fix the bug by the avoiding the root cause of the > problem. It seems to be reliable when I did some quick manual tests. Unit > tests on the full suite a running right now, and I still need to write unit > tests that would fail prior to the fix. I'm going to try to post it to the > new apache reviewboard so you can take a look. > > Tomorrow, I'll try to chat with Eric to reconcile the different approaches, > and figure out the plan to merge the ideas from both and to figure out which > approach to take for the "official fix". This is sufficiently complicated > that I'll try to document the functionality in the wiki. > > Thanks for your analysis, it really helped a lot! > Jon. > > > > On Wed, Aug 10, 2011 at 12:34 PM, Satish Eerpini <[email protected]>wrote: > >> Does anybody have ideas/fixes on how to tackle this ? >> >> what is the best way to stall this , I am thinking of delaying the >> heartbeat checking mechanism so that it waits for some time, before >> going to the nodesman.get() call, would that help ?? >> >> I am also curious about why the behavior with the bug is so >> indeterminate, sometimes the second start goes through and restarts >> the ThriftEventSource after forcing the first one to shutdown, and >> "sometimes" it lands in a Bind error since the first one does not seem >> to have completely closed the listening on the port. >> >> Satish >> >> On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> wrote: >> > Satish: >> > >> > You're 100% dead on. This is a real issue. I fixed this but the code >> > is so critical I've been afraid to commit it. It needs serious review. >> > I'm going to do my best to get the fix up (at least on a branch or >> > something) so folks can pick it apart in the next couple of days. The >> > JIRA tracking this one is >> > https://issues.apache.org/jira/browse/FLUME-706 >> > >> > On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> >> wrote: >> >> Hello folks, >> >> >> >> >> >> There seems to be a convoluted and critical bug in the control flow >> >> withing a logical node, I have been struggling with this issue for the >> >> past three weeks, here is the synopsis : >> >> >> >> Symptoms : using rpcSource starts the thriftEventSource twice, >> >> resulting in weird SocketExceptions on agent( because the second >> >> instance forces the driver to exit and closes the port on which >> >> thriftEventSource is listening, before bringing it up again) >> >> >> >> physical node : "collector" >> >> logical node : "mycollector" mapped to "collector" >> >> >> >> starting a rpcSource on mycollector seems to go through the following >> logic : >> >> >> >> HeartBeatThread -> >> >> heartbeatChecks() -> >> >> checkLogicalNodes() -> >> >> master.getLogicalNodes(physNode) returns "mycollector" -> >> >> nodesman.get("mycollector") is null in the for loop -> >> >> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) -> >> >> nd.loadNodeDriver(src, snk) -> >> >> startNodeDriver() ------------------ starts the thriftEventSource >> ... >> >> >> >> While the heartbeat thread is running, the CheckConfigThread is also >> >> running, which leads to another path where the same driver is started >> >> : >> >> >> >> the heartbeats which are en-queued by HeartBeatThread -> >> >> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig() >> >> >> >> are handled as follows : >> >> >> >> CheckConfigThread -> >> >> dequeueCheckConfig()-> >> >> ln.checkConfig(fcd) -> ------- though checkConfig is synchronized >> >> it does not really make a difference here. >> >> loadConfig(data)-> >> >> loadNodeDriver(newSrc, newSnk)-> >> >> startNodeDriver() >> >> >> >> as you see the above two paths lead to the same driver being opened >> >> twice, so this leads to the second one which reaches driver.start() to >> >> force close the existing driver and open itself up again. >> >> >> >> another symptom is that when this happens, the heartbeats get backed >> >> up, which i believe is because, the dequeueCheckConfig() which follows >> >> the second path above, has to wait for timeout in driver.start() when >> >> it tries to close the existing driver. >> >> >> >> Please let me know if I have missed anything, any help will be greatly >> >> appreciated. >> >> >> >> Thanks >> >> Satish >> >> -- >> >> http://satisheerpini.net >> >> >> > >> > >> > >> > -- >> > Eric Sammer >> > twitter: esammer >> > data: www.cloudera.com >> > >> >> >> >> -- >> http://satisheerpini.net >> > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
