Forgot to include the JIRA link for folks to follow along: https://issues.apache.org/jira/browse/MESOS-1550
On Fri, Jun 27, 2014 at 8:22 PM, Benjamin Hindman < [email protected]> wrote: > If you have written or maintain a Mesos framework please read on. > > *What:* Today a long standing bug was found with the MesosSchedulerDriver > for Java and Python that causes a framework to get unregistered with Mesos > without the framework doing so explicitly. > > *How: *In the normal lifecycle of a framework the scheduler calls > 'stop()' on it's instance of MesosSchedulerDriver when it's done using the > driver. IMPORTANT: If the framework plans to failover it must pass 'true' > to 'stop()', otherwise 'false' (the default). > > Some very old code (from before the introduction of the 'failover' boolean > argument) that gets invoked when a Java or Python MesosSchedulerDriver gets > garbaged collected was calling 'stop()' which was using the default > semantics of 'false' indicating that the framework would not be failing > over and reconnecting to Mesos. > > *Why:* In particular, why wasn't this bug found before? This behavior > only occurs when the MesosSchedulerDriver instance explicitly gets garbaged > collected _AND_ 'stop()' has not already been called. Moreover, in most > applications that don't call stop the MesosSchedulerDriver does not get > garbaged collected either because a reference is maintained for the > lifetime of the application _OR_ the application is terminated before the > garbage collector kicks in! Our best guess of why this was uncovered today > is because, for whatever reason, the garbage collector kicked in and > 'stop()' got invoked. > > *Short-term Mitigation:* > > (1) Never destroy your reference to MesosSchedulerDriver (so the garbage > collector never cleans it up). > (2) Always call 'stop(true)' after you're done with the > MesosSchedulerDriver if you plan on failing over! > > In addition, we'll be releasing a *0.19.1* bug fix release which fixes > this issue. > > Apologies for any inconveniences this may cause folks. Big thanks to > Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down. > > Ben. >
