Forgot to include the JIRA link for folks to follow along:
https://issues.apache.org/jira/browse/MESOS-1550


On Fri, Jun 27, 2014 at 8:22 PM, Benjamin Hindman <
[email protected]> wrote:

> If you have written or maintain a Mesos framework please read on.
>
> *What:* Today a long standing bug was found with the MesosSchedulerDriver
> for Java and Python that causes a framework to get unregistered with Mesos
> without the framework doing so explicitly.
>
> *How: *In the normal lifecycle of a framework the scheduler calls
> 'stop()' on it's instance of MesosSchedulerDriver when it's done using the
> driver. IMPORTANT: If the framework plans to failover it must pass 'true'
> to 'stop()', otherwise 'false' (the default).
>
> Some very old code (from before the introduction of the 'failover' boolean
> argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
> garbaged collected was calling 'stop()' which was using the default
> semantics of 'false' indicating that the framework would not be failing
> over and reconnecting to Mesos.
>
> *Why:* In particular, why wasn't this bug found before? This behavior
> only occurs when the MesosSchedulerDriver instance explicitly gets garbaged
> collected _AND_ 'stop()' has not already been called. Moreover, in most
> applications that don't call stop the MesosSchedulerDriver does not get
> garbaged collected either because a reference is maintained for the
> lifetime of the application _OR_ the application is terminated before the
> garbage collector kicks in! Our best guess of why this was uncovered today
> is because, for whatever reason, the garbage collector kicked in and
> 'stop()' got invoked.
>
> *Short-term Mitigation:*
>
> (1) Never destroy your reference to MesosSchedulerDriver (so the garbage
> collector never cleans it up).
> (2) Always call 'stop(true)' after you're done with the
> MesosSchedulerDriver if you plan on failing over!
>
> In addition, we'll be releasing a *0.19.1* bug fix release which fixes
> this issue.
>
> Apologies for any inconveniences this may cause folks. Big thanks to
> Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.
>
> Ben.
>

Reply via email to