[ 
https://issues.apache.org/jira/browse/MESOS-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AndyPang updated MESOS-7847:
----------------------------
    Docs Text: 
master1 log:
"   6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeout
   6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting 
down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1        
24.157:5051 (10.175.124.157) with message 'health check timed out'
   6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:       
 5051 (10.175.124.157): health check timed out
   6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0
   6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1 
operations in 75998ns; attempting to update the 'registry'
   6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully 
updated the 'registry' in 2.433082ms
   6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check 
timed out"

master2 log:
I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in 
34942ns; attempting to update the 'registry'
I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the 
'registry' in 3.416368ms
I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered 
registrar
I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents 
from the Registry (143B) ; allowing 10mins for agents to re-register
...................................................................
I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157)
I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in 
63487ns; attempting to update the 'registry'
I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the 
'registry' in 4.41893ms
I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; 
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3]
I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4; 
mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; 
core(*):[1-3] (allocated: )
I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending 
updated checkpointed resources  to agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157)
I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157) with total oversubscribed resources
I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with 
oversubscribed resources  (total: cpus(*):4; mem(*):2912; disk(*):24987; 
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )

  was:
master1 log:
"   6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeo        
ut
   6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting 
down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1        
24.157:5051 (10.175.124.157) with message 'health check timed out'
   6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:       
 5051 (10.175.124.157): health check timed out
   6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0
   6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1 
operations in 75998ns; attempting to update the 'registry'
   6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully 
updated the 'registry' in 2.433082ms
   6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check 
timed out"

master2 log:
I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in 
34942ns; attempting to update the 'registry'
I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the 
'registry' in 3.416368ms
I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered 
registrar
I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents 
from the Registry (143B) ; allowing 10mins for agents to re-register
...................................................................
I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157)
I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in 
63487ns; attempting to update the 'registry'
I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the 
'registry' in 4.41893ms
I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987; 
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3]
I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4; 
mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3]; 
core(*):[1-3] (allocated: )
I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending 
updated checkpointed resources  to agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157)
I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of 
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051 
(10.175.124.157) with total oversubscribed resources
I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent 
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with 
oversubscribed resources  (total: cpus(*):4; mem(*):2912; disk(*):24987; 
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )


> Master failover, the 'slaves.recovered' struct contain unvalid slaveID when 
> slave reregistered.
> -----------------------------------------------------------------------------------------------
>
>                 Key: MESOS-7847
>                 URL: https://issues.apache.org/jira/browse/MESOS-7847
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.0.0
>         Environment: os: ubuntu 14.04 mesos-1.0.0 version
> mesos master(2) + mesos agent(1)
>            Reporter: AndyPang
>              Labels: fail, master
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> we run two mesos-masters and one mesos-slave, in order to test mesos HA. We 
> do as follow steps:
> 1. startup master1(leader), master2(follower) and agent, the agent successful 
> registerd to master1;
> 2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5) 
> shutdown master1, as a result master1 remove agent from 'registry' and send 
> 'ShutdownMessage' to agent, but agent process have terminated, so it can't 
> receive this message;
> 3. master1 is terminated and master2 is leader now, it recovered from 
> registry with 0 agent, meanwhile the agent is restarted, the agent 
> 'reregisteded' to master2 is success.
> So the issue is when master recovered from registry with 0 agent, the 
> 'slaves.recovered' strcut don't contain this Slave, it should not admit 
> reregistered success, maybe should send 'ShutdownMessage' to agent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to