[ 
https://issues.apache.org/jira/browse/MESOS-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-8133:
---------------------------------------
    Attachment: IgnoreOldAgentReregistration-badrun.txt
                IgnoreOldAgentReregistration-goodrun.txt

> Investigate whether an agent should distinguish between registered and 
> reregistered messages.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8133
>                 URL: https://issues.apache.org/jira/browse/MESOS-8133
>             Project: Mesos
>          Issue Type: Task
>          Components: agent
>            Reporter: Alexander Rukletsov
>         Attachments: IgnoreOldAgentReregistration-badrun.txt, 
> IgnoreOldAgentReregistration-goodrun.txt
>
>
> I have refined logs from good and bad runs of the 
> {{MasterTest.IgnoreOldAgentReregistration}} test before {{MESOS-7726}} has 
> been fixed. Full logs attached.
> Good run:
> {noformat}
> 19:31:28.542207 16548 slave.cpp:993] New master detected at 
> [email protected]:60888
> 19:31:28.553344 16546 slave.cpp:1147] Successfully authenticated with master 
> [email protected]:60888
> 19:31:28.553395 16546 slave.cpp:1626] Will retry registration in 1.49694ms if 
> necessary
> 19:31:28.553436 16545 master.cpp:5801] Received register agent message from 
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.553725 16545 master.cpp:5954] Registering agent at 
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal) with id 
> 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.554425 16546 slave.cpp:1193] Registered with master 
> [email protected]:60888; given agent ID 
> 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.555630 16542 slave.cpp:993] New master detected at 
> [email protected]:60888
> 19:31:28.555724 16546 slave.cpp:1055] Authenticating with master 
> [email protected]:60888
> 19:31:28.556710 16546 master.cpp:7966] Successfully authenticated principal 
> 'test-principal' at slave(294)@172.16.10.126:60888
> 19:31:28.556766 16546 slave.cpp:1147] Successfully authenticated with master 
> [email protected]:60888
> 19:31:28.556807 16546 slave.cpp:1626] Will retry registration in 9.658747ms 
> if necessary
> 19:31:28.557025 16546 master.cpp:6119] Received re-register agent message 
> from agent 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0 at 
> slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.557148 16546 master.cpp:6222] Ignoring re-registration attempt from 
> old agent at slave(294)@172.16.10.126:60888: agent version is 0.28.1-rc1, 
> minimum supported agent version is 1.0.0
> 19:31:28.557402 16522 slave.cpp:869] Agent terminating
> {noformat}
> Bad run:
> {noformat}
> 04:31:59.541496 15900 slave.cpp:993] New master detected at 
> [email protected]:42919
> 04:31:59.545332 15909 slave.cpp:1147] Successfully authenticated with master 
> [email protected]:42919
> 04:31:59.545543 15909 slave.cpp:1626] Will retry registration in 1.358313ms 
> if necessary
> 04:31:59.545763 15906 master.cpp:5801] Received register agent message from 
> slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.546460 15904 master.cpp:5954] Registering agent at 
> slave(411)@172.17.0.3:42919 (37ad05bac4a7) with id 
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548152 15893 slave.cpp:1626] Will retry registration in 15.765509ms 
> if necessary
> 04:31:59.548413 15912 master.cpp:6032] Registered agent 
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0 at slave(411)@172.17.0.3:42919 
> (37ad05bac4a7) with 
> [{"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
> 04:31:59.548753 15902 slave.cpp:1193] Registered with master 
> [email protected]:42919; given agent ID 
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548921 15912 master.cpp:5801] Received register agent message from 
> slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.549670 15902 slave.cpp:993] New master detected at 
> [email protected]:42919
> 04:31:59.549906 15902 slave.cpp:1193] Registered with master 
> [email protected]:42919; given agent ID 
> 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> /mesos/src/tests/master_tests.cpp:8041: Failure
> Failed to wait 15secs for reregisterSlaveMessage
> 04:32:14.557504 15902 slave.cpp:869] Agent terminating
> {noformat}
> We can see that in the "bad" run, the agent sends {{RegisterSlaveMessage}} 
> twice, likely because the reply from the master has not arrived in 
> {{1.358313ms}} when the agent retries. Because the second 
> {{SlaveRegisteredMessage}} arrives onto the agent during authentication and 
> hence before it attempts to re-register, the {{ReregisterSlaveMessage}} 
> message is never sent out. I suggest we introduce an extra {{REGISTERING}} 
> state, which we enter after a (re)registration request is sent. This way we 
> can drop all registered / reregistered confirmations coming from the previous 
> master during the {{DISCONNECTED}} stage.
> Additionally, looking at the code, it seems that after sending a 
> re-registration request in {{ReregisterSlaveMessage}} to the master, the 
> agent is satisfied (given it comes from the expected master), i.e., 
> transitions from {{DISCONNECTED}} to {{RUNNING}}, with 
> {{SlaveRegisteredMessage}}, losing any {{ReconcileTasksMessage}} that might 
> be delivered in the coming and to be ignored {{SlaveReregisteredMessage}}. 
> This seems incorrect to me, I believe the agent should ignore 
> {{SlaveRegisteredMessage}} after sending {{ReregisterSlaveMessage}}.
> In production environment, if the agent receives a {{registered}} when it 
> supposes to receive {{reregistered}}, can reconciliation go wrong? Also it 
> seems wrong to accept a {{(re)registered}} while authentication is in 
> progress?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to