Michael Park created MESOS-7487:
-----------------------------------
Summary: A framework upgrading into PARTITION_AWARE capability
will continue to receive {{TASK_LOST}} on old agents.
Key: MESOS-7487
URL: https://issues.apache.org/jira/browse/MESOS-7487
Project: Mesos
Issue Type: Bug
Components: agent
Affects Versions: 1.2.0, 1.1.0
Reporter: Michael Park
Before 1.3.0, the master did not send a {{FrameworkInfo}} in the
{{UpdateFrameworkMessage}}. In general, this means that a pre-1.3.0 agent will
not have the {{FrameworkInfo}} updated when a framework changes their
{{FrameworkInfo}}. In specific, if a framework upgrades into having a
{{PARTITION_AWARE}} capability, the 1.1.x and 1.2.x agents will not be aware of
the update, and incorrectly treat report {{TASK_LOST}} in some cases.
Note that the run task path is okay since the master sends the new
{{FrameworkInfo}}. The instances that are incorrect have the following check:
{code}
if (!protobuf::frameworkHasCapability(
framework->info, // This is the one in agent memory!
FrameworkInfo::Capability::PARTITION_AWARE))
{code}
One solution is to backport the changes to {{UpdateFrameworkMessage}} to 1.1.x
and 1.2.x, but only update the capabilities portion of the {{FrameworkInfo}}.
If we update the entire {{FrameworkInfo}}, 1.1.x agent will run into an issue
where it doesn't know how to deal with changes to {{FrameworkInfo.roles}}.
Frameworks changing their roles is a 1.3.x feature. Note that 1.2.x agent can
handle the role changes correctly because of {{Resource.allocation_info}} that
was introduced in multi-role support in 1.2.x.
Refer to MESOS-7460 for the potential issue with backporting to 1.1.x.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)