[ https://issues.apache.org/jira/browse/MESOS-7187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17836731#comment-17836731 ]
Benjamin Mahler commented on MESOS-7187: ---------------------------------------- Observed an actual instance of this, occurred due to the following occurring: 1. ZK session expired 2. Master failover 3. Agent run 1 sends re-registration message to new master with UUID 1. 4. Agent fails over (for upgrade) 5. Agent run 2 sends re-registration message to new master 6. Master receives run 1 re-registration message. 7. Master ignores run 2 re-registration message (as agent is already re-registering). 8. Master completes re-registration and stores resource UUID 1 and notifies agent. 9. Agent receives re-registration completion, sends resource update with UUID 2. 10. Master *does not update* the agent's resource UUID (not because it ignores the update message, but because the logic simply doesn't make any update to it, which looks like a bug), so it remains UUID 1. At this point, any tasks launched on the agent will go to TASK_DROPPED due to "Task assumes outdated resource state". The agent must be restarted at this point to fix the issue. > Master can neglect to update agent metadata in a re-registration corner case. > ----------------------------------------------------------------------------- > > Key: MESOS-7187 > URL: https://issues.apache.org/jira/browse/MESOS-7187 > Project: Mesos > Issue Type: Bug > Reporter: Benjamin Mahler > Priority: Major > Labels: tech-debt > > If the agent is re-registering with the master for the first time, the master > will drop any re-registration messages that arrive while the registry > operation is in progress. > These dropped messages can have different metadata (e.g. version, > capabilities, etc) that gets dropped. Since the master doesn't distinguish > between different instances of the agent (both share the same UPID and there > is no instance identifying information), the master can't tell whether this > is a retry from the original instance of the agent or a re-registration from > a new instance of the agent. > The following is an example: > (1) Master restarts. > (2) Agent re-registers with OLD_VERSION / OLD_CAPABILITIES. > (3) While registry operation is in progress, agent is upgraded and > re-registers with NEW_VERSION / NEW_CAPABILITIES. > (4) Registry operation completes, new agent receives the re-registration > acknowledgement message and so, does not retry. > (5) Now, the master's memory reflects OLD_VERSION / OLD_CAPABILITIES for the > agent which remains inconsistent until a later re-registration occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010)