Here's the plan. Eric, Satish, and I found different parts of the problem. Satish and I went down similar approaches and I have a rough draft for a local fix of the immediate problem. Eric and I chatted and we are convinced that this is a reasonable fix for now. We'll move forward on trunk with local fix and the evolutionary approach. We'll try to do a local refactor after that attempting to keep mainline backwards compatible.
Eric is experimenting with a larger more fundamental refactor that could simplify lifecycle throughput many components of the system. This is being explored in flume-728 branch and may get merged in in the future. Jon. On Thu, Aug 11, 2011 at 11:42 PM, Ralph Goers <[email protected]>wrote: > While it is OK to talk about the problem off list, please post the result > of your conversation back here. Otherwise the private discussion didn't > really happen. > > Ralph > > On Aug 11, 2011, at 2:25 AM, Jonathan Hsieh wrote: > > > Satish, > > > > The why question is what was really bugging me, and I was able to get > into > > the right state of mind to try to tacke this (note to self, pizza + beer > == > > fuel for jon to deal with concurrency). > > > > I looked at your notes, Eric's patch and logs, found a trace that > generates > > the problem, and found what I believe to the root cause. I have attached > a > > patch to the jira [1] that is the quicker of the suggested fixes. > > > > The patch seems to fix the bug by the avoiding the root cause of the > > problem. It seems to be reliable when I did some quick manual tests. > Unit > > tests on the full suite a running right now, and I still need to write > unit > > tests that would fail prior to the fix. I'm going to try to post it to > the > > new apache reviewboard so you can take a look. > > > > Tomorrow, I'll try to chat with Eric to reconcile the different > approaches, > > and figure out the plan to merge the ideas from both and to figure out > which > > approach to take for the "official fix". This is sufficiently > complicated > > that I'll try to document the functionality in the wiki. > > > > Thanks for your analysis, it really helped a lot! > > Jon. > > > > > > > > On Wed, Aug 10, 2011 at 12:34 PM, Satish Eerpini <[email protected]> > wrote: > > > >> Does anybody have ideas/fixes on how to tackle this ? > >> > >> what is the best way to stall this , I am thinking of delaying the > >> heartbeat checking mechanism so that it waits for some time, before > >> going to the nodesman.get() call, would that help ?? > >> > >> I am also curious about why the behavior with the bug is so > >> indeterminate, sometimes the second start goes through and restarts > >> the ThriftEventSource after forcing the first one to shutdown, and > >> "sometimes" it lands in a Bind error since the first one does not seem > >> to have completely closed the listening on the port. > >> > >> Satish > >> > >> On Tue, Aug 9, 2011 at 5:17 PM, Eric Sammer <[email protected]> > wrote: > >>> Satish: > >>> > >>> You're 100% dead on. This is a real issue. I fixed this but the code > >>> is so critical I've been afraid to commit it. It needs serious review. > >>> I'm going to do my best to get the fix up (at least on a branch or > >>> something) so folks can pick it apart in the next couple of days. The > >>> JIRA tracking this one is > >>> https://issues.apache.org/jira/browse/FLUME-706 > >>> > >>> On Tue, Aug 9, 2011 at 5:07 PM, Satish Eerpini <[email protected]> > >> wrote: > >>>> Hello folks, > >>>> > >>>> > >>>> There seems to be a convoluted and critical bug in the control flow > >>>> withing a logical node, I have been struggling with this issue for the > >>>> past three weeks, here is the synopsis : > >>>> > >>>> Symptoms : using rpcSource starts the thriftEventSource twice, > >>>> resulting in weird SocketExceptions on agent( because the second > >>>> instance forces the driver to exit and closes the port on which > >>>> thriftEventSource is listening, before bringing it up again) > >>>> > >>>> physical node : "collector" > >>>> logical node : "mycollector" mapped to "collector" > >>>> > >>>> starting a rpcSource on mycollector seems to go through the following > >> logic : > >>>> > >>>> HeartBeatThread -> > >>>> heartbeatChecks() -> > >>>> checkLogicalNodes() -> > >>>> master.getLogicalNodes(physNode) returns "mycollector" -> > >>>> nodesman.get("mycollector") is null in the for loop -> > >>>> nodesman.spawn(ln, data.getSourceConfig(), data.getSinkConfig()) -> > >>>> nd.loadNodeDriver(src, snk) -> > >>>> startNodeDriver() ------------------ starts the thriftEventSource > ... > >>>> > >>>> While the heartbeat thread is running, the CheckConfigThread is also > >>>> running, which leads to another path where the same driver is started > >>>> : > >>>> > >>>> the heartbeats which are en-queued by HeartBeatThread -> > >>>> heartBeatChecks() -> checkLogicalNodeConfigs()-> enqueueCheckConfig() > >>>> > >>>> are handled as follows : > >>>> > >>>> CheckConfigThread -> > >>>> dequeueCheckConfig()-> > >>>> ln.checkConfig(fcd) -> ------- though checkConfig is synchronized > >>>> it does not really make a difference here. > >>>> loadConfig(data)-> > >>>> loadNodeDriver(newSrc, newSnk)-> > >>>> startNodeDriver() > >>>> > >>>> as you see the above two paths lead to the same driver being opened > >>>> twice, so this leads to the second one which reaches driver.start() to > >>>> force close the existing driver and open itself up again. > >>>> > >>>> another symptom is that when this happens, the heartbeats get backed > >>>> up, which i believe is because, the dequeueCheckConfig() which follows > >>>> the second path above, has to wait for timeout in driver.start() when > >>>> it tries to close the existing driver. > >>>> > >>>> Please let me know if I have missed anything, any help will be greatly > >>>> appreciated. > >>>> > >>>> Thanks > >>>> Satish > >>>> -- > >>>> http://satisheerpini.net > >>>> > >>> > >>> > >>> > >>> -- > >>> Eric Sammer > >>> twitter: esammer > >>> data: www.cloudera.com > >>> > >> > >> > >> > >> -- > >> http://satisheerpini.net > >> > > > > > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // [email protected] > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [email protected]
