If you have written or maintain a Mesos framework please read on.

*What:* Today a long standing bug was found with the MesosSchedulerDriver
for Java and Python that causes a framework to get unregistered with Mesos
without the framework doing so explicitly.

*How: *In the normal lifecycle of a framework the scheduler calls 'stop()'
on it's instance of MesosSchedulerDriver when it's done using the driver.
IMPORTANT: If the framework plans to failover it must pass 'true' to
'stop()', otherwise 'false' (the default).

Some very old code (from before the introduction of the 'failover' boolean
argument) that gets invoked when a Java or Python MesosSchedulerDriver gets
garbaged collected was calling 'stop()' which was using the default
semantics of 'false' indicating that the framework would not be failing
over and reconnecting to Mesos.

*Why:* In particular, why wasn't this bug found before? This behavior only
occurs when the MesosSchedulerDriver instance explicitly gets garbaged
collected _AND_ 'stop()' has not already been called. Moreover, in most
applications that don't call stop the MesosSchedulerDriver does not get
garbaged collected either because a reference is maintained for the
lifetime of the application _OR_ the application is terminated before the
garbage collector kicks in! Our best guess of why this was uncovered today
is because, for whatever reason, the garbage collector kicked in and
'stop()' got invoked.

*Short-term Mitigation:*

(1) Never destroy your reference to MesosSchedulerDriver (so the garbage
collector never cleans it up).
(2) Always call 'stop(true)' after you're done with the
MesosSchedulerDriver if you plan on failing over!

In addition, we'll be releasing a *0.19.1* bug fix release which fixes this
issue.

Apologies for any inconveniences this may cause folks. Big thanks to
Whitney Sorensen for reporting the bug and Vinod Kone for tracking it down.

Ben.

Reply via email to