[
https://issues.apache.org/jira/browse/MESOS-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Rukletsov updated MESOS-8133:
---------------------------------------
Attachment: IgnoreOldAgentReregistration-badrun.txt
IgnoreOldAgentReregistration-goodrun.txt
> Investigate whether an agent should distinguish between registered and
> reregistered messages.
> ---------------------------------------------------------------------------------------------
>
> Key: MESOS-8133
> URL: https://issues.apache.org/jira/browse/MESOS-8133
> Project: Mesos
> Issue Type: Task
> Components: agent
> Reporter: Alexander Rukletsov
> Attachments: IgnoreOldAgentReregistration-badrun.txt,
> IgnoreOldAgentReregistration-goodrun.txt
>
>
> I have refined logs from good and bad runs of the
> {{MasterTest.IgnoreOldAgentReregistration}} test before {{MESOS-7726}} has
> been fixed. Full logs attached.
> Good run:
> {noformat}
> 19:31:28.542207 16548 slave.cpp:993] New master detected at
> [email protected]:60888
> 19:31:28.553344 16546 slave.cpp:1147] Successfully authenticated with master
> [email protected]:60888
> 19:31:28.553395 16546 slave.cpp:1626] Will retry registration in 1.49694ms if
> necessary
> 19:31:28.553436 16545 master.cpp:5801] Received register agent message from
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.553725 16545 master.cpp:5954] Registering agent at
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal) with id
> 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.554425 16546 slave.cpp:1193] Registered with master
> [email protected]:60888; given agent ID
> 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.555630 16542 slave.cpp:993] New master detected at
> [email protected]:60888
> 19:31:28.555724 16546 slave.cpp:1055] Authenticating with master
> [email protected]:60888
> 19:31:28.556710 16546 master.cpp:7966] Successfully authenticated principal
> 'test-principal' at slave(294)@172.16.10.126:60888
> 19:31:28.556766 16546 slave.cpp:1147] Successfully authenticated with master
> [email protected]:60888
> 19:31:28.556807 16546 slave.cpp:1626] Will retry registration in 9.658747ms
> if necessary
> 19:31:28.557025 16546 master.cpp:6119] Received re-register agent message
> from agent 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0 at
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.557148 16546 master.cpp:6222] Ignoring re-registration attempt from
> old agent at slave(294)@172.16.10.126:60888: agent version is 0.28.1-rc1,
> minimum supported agent version is 1.0.0
> 19:31:28.557402 16522 slave.cpp:869] Agent terminating
> {noformat}
> Bad run:
> {noformat}
> 04:31:59.541496 15900 slave.cpp:993] New master detected at
> [email protected]:42919
> 04:31:59.545332 15909 slave.cpp:1147] Successfully authenticated with master
> [email protected]:42919
> 04:31:59.545543 15909 slave.cpp:1626] Will retry registration in 1.358313ms
> if necessary
> 04:31:59.545763 15906 master.cpp:5801] Received register agent message from
> slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.546460 15904 master.cpp:5954] Registering agent at
> slave(411)@172.17.0.3:42919 (37ad05bac4a7) with id
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548152 15893 slave.cpp:1626] Will retry registration in 15.765509ms
> if necessary
> 04:31:59.548413 15912 master.cpp:6032] Registered agent
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0 at slave(411)@172.17.0.3:42919
> (37ad05bac4a7) with
> [{"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
> 04:31:59.548753 15902 slave.cpp:1193] Registered with master
> [email protected]:42919; given agent ID
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548921 15912 master.cpp:5801] Received register agent message from
> slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.549670 15902 slave.cpp:993] New master detected at
> [email protected]:42919
> 04:31:59.549906 15902 slave.cpp:1193] Registered with master
> [email protected]:42919; given agent ID
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> /mesos/src/tests/master_tests.cpp:8041: Failure
> Failed to wait 15secs for reregisterSlaveMessage
> 04:32:14.557504 15902 slave.cpp:869] Agent terminating
> {noformat}
> We can see that in the "bad" run, the agent sends {{RegisterSlaveMessage}}
> twice, likely because the reply from the master has not arrived in
> {{1.358313ms}} when the agent retries. Because the second
> {{SlaveRegisteredMessage}} arrives onto the agent during authentication and
> hence before it attempts to re-register, the {{ReregisterSlaveMessage}}
> message is never sent out. I suggest we introduce an extra {{REGISTERING}}
> state, which we enter after a (re)registration request is sent. This way we
> can drop all registered / reregistered confirmations coming from the previous
> master during the {{DISCONNECTED}} stage.
> Additionally, looking at the code, it seems that after sending a
> re-registration request in {{ReregisterSlaveMessage}} to the master, the
> agent is satisfied (given it comes from the expected master), i.e.,
> transitions from {{DISCONNECTED}} to {{RUNNING}}, with
> {{SlaveRegisteredMessage}}, losing any {{ReconcileTasksMessage}} that might
> be delivered in the coming and to be ignored {{SlaveReregisteredMessage}}.
> This seems incorrect to me, I believe the agent should ignore
> {{SlaveRegisteredMessage}} after sending {{ReregisterSlaveMessage}}.
> In production environment, if the agent receives a {{registered}} when it
> supposes to receive {{reregistered}}, can reconciliation go wrong? Also it
> seems wrong to accept a {{(re)registered}} while authentication is in
> progress?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)