If you have written or maintain a Mesos framework please read on. *What:* Today a long standing bug was found with the MesosSchedulerDriver for Java and Python that causes a framework to get unregistered with Mesos without the framework doing so explicitly.
*How: *In the normal lifecycle of a framework the scheduler calls 'stop()' on it's instance of MesosSchedulerDriver when it's done using the driver. IMPORTANT: If the framework plans to failover it must pass 'true' to 'stop()', otherwise 'false' (the default). Some very old code (from before the introduction of the 'failover' boolean argument) that gets invoked when a Java or Python MesosSchedulerDriver gets garbaged collected was calling 'stop()' which was using the default semantics of 'false' indicating that the framework would not be failing over and reconnecting to Mesos. *Why:* In particular, why wasn't this bug found before? This behavior only occurs when the MesosSchedulerDriver instance explicitly gets garbaged collected _AND_ 'stop()' has not already been called. Moreover, in most applications that don't call stop the MesosSchedulerDriver does not get garbaged collected either because a reference is maintained for the lifetime of the application _OR_ the application is terminated before the garbage collector kicks in! Our best guess of why this was uncovered today is because, for whatever reason, the garbage collector kicked in and 'stop()' got invoked. *Short-term Mitigation:* (1) Never destroy your reference to MesosSchedulerDriver (so the garbage collector never cleans it up). (2) Always call 'stop(true)' after you're done with the MesosSchedulerDriver if you plan on failing over! In addition, we'll be releasing a *0.19.1* bug fix release which fixes this issue. Apologies for any inconveniences this may cause folks. Big thanks to Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down. Ben.
