[
https://issues.apache.org/jira/browse/MESOS-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14088947#comment-14088947
]
Adam B commented on MESOS-1630:
-------------------------------
I can interpret a couple of problem scenarios from this JIRA's description:
1) Imagine if a slave was running tasks while disconnected during the period
when the framework was unregistered and then re-registered. In that case, all
the other tasks for that framework on other connected slaves have been
shutdown, and one would think that we would want all the old tasks to shut
down, since the newly re-registered framework wouldn't necessarily know about
them (and might even have new FrameworkInfo values).
2) This ticket may be describing a problem where new tasks launched by the
re-registered framework are being shutdown when a slave starts running a new
task and then re-registers/reconciles, since the frameworkId was added to
frameworks.registered but never removed from frameworks.completed. In this
case, I agree that the framework should not be considered "completed" and the
the new task needs to keep running.
But even if we're just trying to solve scenario 2 by calling
frameworks.completed.erase in Master::addFramework or _reregisterFramework, we
still have to consider how to handle scenario 1.
Implementation Detail:
- When we erase the re-registering Framework from frameworks.completed and add
it to frameworks.registered, should we copy over completedExecutors,
completedTasks, or the original registeredTime/unregisteredTime? The
FrameworkInfo should probably come from the ReregisterMessage (not the old
completed framework) though.
- Or maybe we start with the old Framework struct and just update the
FrameworkInfo, pid, and reregisteredTime?
> Remove framework from completedFrameworks if framework re-registers.
> --------------------------------------------------------------------
>
> Key: MESOS-1630
> URL: https://issues.apache.org/jira/browse/MESOS-1630
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.14.0, 0.14.1, 0.14.2, 0.17.0, 0.16.0, 0.15.0, 0.18.0,
> 0.18.1, 0.18.2, 0.19.0, 0.19.1
> Reporter: Benjamin Hindman
> Assignee: Bernd Mathiske
> Priority: Critical
>
> If a framework gets removed, for example, because it unregisters with the
> master (i.e., due to MESOS-1550), but then the same framework ID is reused
> when a framework re-registers (which we currently allow) then we should
> remove the framework from Master::frameworks.completed otherwise when a slave
> re-registers then in Master::reconcile we'll notice that the slave is running
> tasks from a "completed" framework and tell the slave to shutdown that
> framework, thus shutting down all of the tasks.
> This should be easily fixed by removing the framework from
> frameworks.completed when a framework re-registers with the same ID as a
> completed framework.
--
This message was sent by Atlassian JIRA
(v6.2#6252)