[
https://issues.apache.org/jira/browse/MESOS-7847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
AndyPang updated MESOS-7847:
----------------------------
Docs Text:
master1 log:
" 6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeout
6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting
down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1
24.157:5051 (10.175.124.157) with message 'health check timed out'
6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:
5051 (10.175.124.157): health check timed out
6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0
6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1
operations in 75998ns; attempting to update the 'registry'
6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully
updated the 'registry' in 2.433082ms
6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check
timed out"
master2 log:
I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in
34942ns; attempting to update the 'registry'
I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the
'registry' in 3.416368ms
I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered
registrar
I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents
from the Registry (143B) ; allowing 10mins for agents to re-register
...................................................................
I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157)
I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in
63487ns; attempting to update the 'registry'
I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the
'registry' in 4.41893ms
I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987;
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3]
I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4;
mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3];
core(*):[1-3] (allocated: )
I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending
updated checkpointed resources to agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157)
I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157) with total oversubscribed resources
I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with
oversubscribed resources (total: cpus(*):4; mem(*):2912; disk(*):24987;
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )
was:
master1 log:
" 6698 I0802 15:35:57.276021 20539 master.:238] 20534,_shutdown]Shutting down
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 due to health check timeo
ut
6699 W0802 15:35:57.276119 20539 master.:5474] 20534,shutdownSlave]Shutting
down agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.1
24.157:5051 (10.175.124.157) with message 'health check timed out'
6700 I0802 15:35:57.276173 20539 master.:6641] 20534,removeSlave]Removing
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:
5051 (10.175.124.157): health check timed out
6701 I0802 15:35:57.276748 20542 allocat:510] 20534,removeSlave]Removed
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0
6702 I0802 15:35:57.277498 20539 registr:468] 20534,update]Applied 1
operations in 75998ns; attempting to update the 'registry'
6703 I0802 15:35:57.279986 20543 registr:513] 20534,_update]Successfully
updated the 'registry' in 2.433082ms
6704 I0802 15:35:57.280320 20539 master.:6759] 20534,_removeSlave]Removed
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157): health check
timed out"
master2 log:
I0802 15:37:39.663698 20747 registr:468] 20745,update]Applied 1 operations in
34942ns; attempting to update the 'registry'
I0802 15:37:39.667167 20747 registr:513] 20745,_update]Successfully updated the
'registry' in 3.416368ms
I0802 15:37:39.667223 20747 registr:399] 20745,__recover]Successfully recovered
registrar
I0802 15:37:39.667393 20747 master.:1552] 20745,_recover]Recovered 0 agents
from the Registry (143B) ; allowing 10mins for agents to re-register
...................................................................
I0802 15:38:01.305759 20750 master.:4925] 20745,reregisterSlave]Re-registering
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157)
I0802 15:38:01.306411 20750 registr:468] 20745,update]Applied 1 operations in
63487ns; attempting to update the 'registry'
I0802 15:38:01.310890 20750 registr:513] 20745,_update]Successfully updated the
'registry' in 4.41893ms
I0802 15:38:01.312330 20749 master.:5031] 20745,_reregisterSlave]Re-registered
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157) with cpus(*):4; mem(*):2912; disk(*):24987;
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3]
I0802 15:38:01.312338 20752 allocat:478] 20745,addSlave]Added agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) with cpus(*):4;
mem(*):2912; disk(*):24987; ports(*):[31000-32000]; cpu_set(*):[0-3];
core(*):[1-3] (allocated: )
I0802 15:38:01.312372 20749 master.:5099] 20745,__reregisterSlave]Sending
updated checkpointed resources to agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157)
I0802 15:38:01.313002 20751 master.:5161] 20745,updateSlave]Received update of
agent 8c1bff32-a25e-4c46-b79a-353e05754174-S0 at slave(1)@10.175.124.157:5051
(10.175.124.157) with total oversubscribed resources
I0802 15:38:01.313367 20751 allocat:542] 20745,updateSlave]Agent
8c1bff32-a25e-4c46-b79a-353e05754174-S0 (10.175.124.157) updated with
oversubscribed resources (total: cpus(*):4; mem(*):2912; disk(*):24987;
ports(*):[31000-32000]; cpu_set(*):[0-3]; core(*):[1-3], allocated: )
> Master failover, the 'slaves.recovered' struct contain unvalid slaveID when
> slave reregistered.
> -----------------------------------------------------------------------------------------------
>
> Key: MESOS-7847
> URL: https://issues.apache.org/jira/browse/MESOS-7847
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.0.0
> Environment: os: ubuntu 14.04 mesos-1.0.0 version
> mesos master(2) + mesos agent(1)
> Reporter: AndyPang
> Labels: fail, master
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> we run two mesos-masters and one mesos-slave, in order to test mesos HA. We
> do as follow steps:
> 1. startup master1(leader), master2(follower) and agent, the agent successful
> registerd to master1;
> 2. shutdown agent, after 75s(the master and agent ping-pong is 15s*5)
> shutdown master1, as a result master1 remove agent from 'registry' and send
> 'ShutdownMessage' to agent, but agent process have terminated, so it can't
> receive this message;
> 3. master1 is terminated and master2 is leader now, it recovered from
> registry with 0 agent, meanwhile the agent is restarted, the agent
> 'reregisteded' to master2 is success.
> So the issue is when master recovered from registry with 0 agent, the
> 'slaves.recovered' strcut don't contain this Slave, it should not admit
> reregistered success, maybe should send 'ShutdownMessage' to agent.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)