Alexander Rukletsov created MESOS-8133:
------------------------------------------

             Summary: Investigate whether an agent should distinguish between 
registered and reregistered messages.
                 Key: MESOS-8133
                 URL: https://issues.apache.org/jira/browse/MESOS-8133
             Project: Mesos
          Issue Type: Task
          Components: agent
            Reporter: Alexander Rukletsov


I have refined logs from good and bad runs of the 
{{MasterTest.IgnoreOldAgentReregistration}} test before {{MESOS-7726}} has been 
fixed. Full logs attached.
Good run:
{noformat}
19:31:28.542207 16548 slave.cpp:993] New master detected at 
[email protected]:60888
19:31:28.553344 16546 slave.cpp:1147] Successfully authenticated with master 
[email protected]:60888
19:31:28.553395 16546 slave.cpp:1626] Will retry registration in 1.49694ms if 
necessary
19:31:28.553436 16545 master.cpp:5801] Received register agent message from 
slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
19:31:28.553725 16545 master.cpp:5954] Registering agent at 
slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal) with id 
43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
19:31:28.554425 16546 slave.cpp:1193] Registered with master 
[email protected]:60888; given agent ID 
43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0

19:31:28.555630 16542 slave.cpp:993] New master detected at 
[email protected]:60888
19:31:28.555724 16546 slave.cpp:1055] Authenticating with master 
[email protected]:60888
19:31:28.556710 16546 master.cpp:7966] Successfully authenticated principal 
'test-principal' at slave(294)@172.16.10.126:60888
19:31:28.556766 16546 slave.cpp:1147] Successfully authenticated with master 
[email protected]:60888
19:31:28.556807 16546 slave.cpp:1626] Will retry registration in 9.658747ms if 
necessary
19:31:28.557025 16546 master.cpp:6119] Received re-register agent message from 
agent 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0 at slave(294)@172.16.10.126:60888 
(ip-172-16-10-126.ec2.internal)
19:31:28.557148 16546 master.cpp:6222] Ignoring re-registration attempt from 
old agent at slave(294)@172.16.10.126:60888: agent version is 0.28.1-rc1, 
minimum supported agent version is 1.0.0
19:31:28.557402 16522 slave.cpp:869] Agent terminating
{noformat}
Bad run:
{noformat}
04:31:59.541496 15900 slave.cpp:993] New master detected at 
[email protected]:42919
04:31:59.545332 15909 slave.cpp:1147] Successfully authenticated with master 
[email protected]:42919
04:31:59.545543 15909 slave.cpp:1626] Will retry registration in 1.358313ms if 
necessary
04:31:59.545763 15906 master.cpp:5801] Received register agent message from 
slave(411)@172.17.0.3:42919 (37ad05bac4a7)
04:31:59.546460 15904 master.cpp:5954] Registering agent at 
slave(411)@172.17.0.3:42919 (37ad05bac4a7) with id 
1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
04:31:59.548152 15893 slave.cpp:1626] Will retry registration in 15.765509ms if 
necessary
04:31:59.548413 15912 master.cpp:6032] Registered agent 
1069c7ab-2d90-4a12-996b-ecfe9a871217-S0 at slave(411)@172.17.0.3:42919 
(37ad05bac4a7) with 
[{"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
04:31:59.548753 15902 slave.cpp:1193] Registered with master 
[email protected]:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
04:31:59.548921 15912 master.cpp:5801] Received register agent message from 
slave(411)@172.17.0.3:42919 (37ad05bac4a7)

04:31:59.549670 15902 slave.cpp:993] New master detected at 
[email protected]:42919
04:31:59.549906 15902 slave.cpp:1193] Registered with master 
[email protected]:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
/mesos/src/tests/master_tests.cpp:8041: Failure
Failed to wait 15secs for reregisterSlaveMessage
04:32:14.557504 15902 slave.cpp:869] Agent terminating
{noformat}
We can see that in the "bad" run, the agent sends {{RegisterSlaveMessage}} 
twice, likely because the reply from the master has not arrived in 
{{1.358313ms}} when the agent retries. Because the second 
{{SlaveRegisteredMessage}} arrives onto the agent during authentication and 
hence before it attempts to re-register, the {{ReregisterSlaveMessage}} message 
is never sent out. I suggest we introduce an extra {{REGISTERING}} state, which 
we enter after a (re)registration request is sent. This way we can drop all 
registered / reregistered confirmations coming from the previous master during 
the {{DISCONNECTED}} stage.

Additionally, looking at the code, it seems that after sending a 
re-registration request in {{ReregisterSlaveMessage}} to the master, the agent 
is satisfied (given it comes from the expected master), i.e., transitions from 
{{DISCONNECTED}} to {{RUNNING}}, with {{SlaveRegisteredMessage}}, losing any 
{{ReconcileTasksMessage}} that might be delivered in the coming and to be 
ignored {{SlaveReregisteredMessage}}. This seems incorrect to me, I believe the 
agent should ignore {{SlaveRegisteredMessage}} after sending 
{{ReregisterSlaveMessage}}.

In production environment, if the agent receives a {{registered}} when it 
supposes to receive {{reregistered}}, can reconciliation go wrong? Also it 
seems wrong to accept a {{(re)registered}} while authentication is in progress?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to