[ 
https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836731#comment-17836731
 ] 

Benjamin Mahler commented on MESOS-7187:
----------------------------------------

Observed an actual instance of this, occurred due to the following occurring:

1. ZK session expired
2. Master failover
3. Agent run 1 sends re-registration message to new master with UUID 1.
4. Agent fails over (for upgrade)
5. Agent run 2 sends re-registration message to new master
6. Master receives run 1 re-registration message.
7. Master ignores run 2 re-registration message (as agent is already 
re-registering).
8. Master completes re-registration and stores resource UUID 1 and notifies 
agent.
9. Agent receives re-registration completion, sends resource update with UUID 2.
10. Master *does not update* the agent's resource UUID (not because it ignores 
the update message, but because the logic simply doesn't make any update to it, 
which looks like a bug), so it remains UUID 1.

At this point, any tasks launched on the agent will go to TASK_DROPPED due to 
"Task assumes outdated resource state". The agent must be restarted at this 
point to fix the issue.


> Master can neglect to update agent metadata in a re-registration corner case.
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-7187
>                 URL: https://issues.apache.org/jira/browse/MESOS-7187
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Priority: Major
>              Labels: tech-debt
>
> If the agent is re-registering with the master for the first time, the master 
> will drop any re-registration messages that arrive while the registry 
> operation is in progress.
> These dropped messages can have different metadata (e.g. version, 
> capabilities, etc) that gets dropped. Since the master doesn't distinguish 
> between different instances of the agent (both share the same UPID and there 
> is no instance identifying information), the master can't tell whether this 
> is a retry from the original instance of the agent or a re-registration from 
> a new instance of the agent.
> The following is an example:
> (1) Master restarts.
> (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES.
> (3) While registry operation is in progress, agent is upgraded and 
> re-registers with NEW_VERSION / NEW_CAPABILITIES.
> (4) Registry operation completes, new agent receives the re-registration 
> acknowledgement message and so, does not retry.
> (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the 
> agent which remains inconsistent until a later re-registration occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to