Vinod Kone created MESOS-3907:
---------------------------------
Summary: PersistentVolumeTest.SlaveRecovery test is flaky
Key: MESOS-3907
URL: https://issues.apache.org/jira/browse/MESOS-3907
Project: Mesos
Issue Type: Bug
Environment: ASF CI
Reporter: Vinod Kone
Looks like the executor didn't re-register in time after slave restart. The
Clock::settle() should've guaranteed that the executor sent re-register and
slave processed it, so it's not clear why this race happened.
{code}
I1112 17:49:00.663300 28118 slave.cpp:392] Slave resources: cpus(*):2;
mem(*):1024; disk(role1):1024; ports(*):[31000-32000]
I1112 17:49:00.663388 28118 slave.cpp:400] Slave attributes: [ ]
I1112 17:49:00.663406 28118 slave.cpp:405] Slave hostname: f4f9d084d7ba
I1112 17:49:00.663419 28118 slave.cpp:410] Slave checkpoint: true
I1112 17:49:00.669028 28115 state.cpp:54] Recovering state from
'/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta'
I1112 17:49:00.675704 28121 fetcher.cpp:79] Clearing fetcher cache
I1112 17:49:00.675799 28121 slave.cpp:4318] Recovering framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.675891 28121 slave.cpp:5108] Recovering executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677217 28119 status_update_manager.cpp:202] Recovering status
update manager
I1112 17:49:00.677250 28119 status_update_manager.cpp:210] Recovering executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677326 28119 status_update_manager.cpp:499] Creating
StatusUpdate stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.677896 28119 status_update_manager.cpp:802] Replaying status
update stream for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145
I1112 17:49:00.678973 28121 slave.cpp:682] Successfully attached file
'/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/executors/2b35f3fc-373c-4b5f-bbe6-dc3279117145/runs/b9dbc84e-57fa-49d5-840f-8f6d9f0233fc'
I1112 17:49:00.679440 28121 containerizer.cpp:385] Recovering containerizer
I1112 17:49:00.679527 28121 containerizer.cpp:440] Recovering container
'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' for executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.685341 28115 slave.cpp:4170] Sending reconnect request to
executor '2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865
I1112 17:49:00.690004 1529 exec.cpp:256] Received reconnect request from slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0
I1112 17:49:00.703845 28122 hierarchical.cpp:977] No resources available to
allocate!
I1112 17:49:00.703917 28122 hierarchical.cpp:1070] No inverse offers to send
out!
I1112 17:49:00.704025 28122 hierarchical.cpp:876] Performed allocation for 1
slaves in 271350ns
I1112 17:49:00.703974 28116 slave.cpp:2702] Cleaning up un-reregistered
executors
I1112 17:49:00.704107 28116 slave.cpp:2720] Killing un-reregistered executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865
I1112 17:49:00.704262 28116 slave.cpp:4230] Finished recovery
I1112 17:49:00.704397 28123 containerizer.cpp:1075] Destroying container
'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc'
I1112 17:49:00.706696 28116 slave.cpp:4387] Querying resource estimator for
oversubscribable resources
I1112 17:49:00.707142 28127 slave.cpp:4401] Received oversubscribable resources
from the resource estimator
W1112 17:49:00.707540 28127 slave.cpp:2546] Shutting down executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 because the slave is not in recovery
mode
I1112 17:49:00.708133 28116 slave.cpp:729] New master detected at
[email protected]:32961
I1112 17:49:00.708228 28116 slave.cpp:792] Authenticating with master
[email protected]:32961
I1112 17:49:00.708253 28116 slave.cpp:797] Using default CRAM-MD5 authenticatee
I1112 17:49:00.708421 28116 slave.cpp:765] Detecting new master
I1112 17:49:00.708533 28116 status_update_manager.cpp:176] Pausing sending
status updates
I1112 17:49:00.708673 28116 authenticatee.cpp:123] Creating new client SASL
connection
I1112 17:49:00.709913 28116 master.cpp:5150] Authenticating
slave(197)@172.17.6.239:32961
I1112 17:49:00.710218 28116 authenticator.cpp:415] Starting authentication
session for crammd5_authenticatee(467)@172.17.6.239:32961
I1112 17:49:00.710438 1516 exec.cpp:383] Executor asked to shutdown
I1112 17:49:00.710582 1516 exec.cpp:398] Executor::shutdown took 35481ns
I1112 17:49:00.710598 28116 authenticator.cpp:100] Creating new server SASL
connection
I1112 17:49:00.710705 1516 exec.cpp:82] Scheduling shutdown of the executor
Shutting down
Sending SIGTERM to process tree at pid 1531
I1112 17:49:00.710830 28116 authenticatee.cpp:214] Received SASL authentication
mechanisms: CRAM-MD5
I1112 17:49:00.710866 28116 authenticatee.cpp:240] Attempting to authenticate
with mechanism 'CRAM-MD5'
I1112 17:49:00.712005 28116 authenticator.cpp:205] Received SASL authentication
start
I1112 17:49:00.712069 28116 authenticator.cpp:327] Authentication requires more
steps
I1112 17:49:00.712214 28124 authenticatee.cpp:260] Received SASL authentication
step
I1112 17:49:00.712357 28116 authenticator.cpp:233] Received SASL authentication
step
I1112 17:49:00.712389 28116 auxprop.cpp:109] Request to lookup properties for
user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba'
SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false
SASL_AUXPROP_AUTHZID: false
I1112 17:49:00.712412 28116 auxprop.cpp:181] Looking up auxiliary property
'*userPassword'
I1112 17:49:00.712467 28116 auxprop.cpp:181] Looking up auxiliary property
'*cmusaslsecretCRAM-MD5'
I1112 17:49:00.712507 28116 auxprop.cpp:109] Request to lookup properties for
user: 'test-principal' realm: 'f4f9d084d7ba' server FQDN: 'f4f9d084d7ba'
SASL_AUXPROP_VERIFY_AGAINST_HASH: false SASL_AUXPROP_OVERRIDE: false
SASL_AUXPROP_AUTHZID: true
I1112 17:49:00.712524 28116 auxprop.cpp:131] Skipping auxiliary property
'*userPassword' since SASL_AUXPROP_AUTHZID == true
I1112 17:49:00.712534 28116 auxprop.cpp:131] Skipping auxiliary property
'*cmusaslsecretCRAM-MD5' since SASL_AUXPROP_AUTHZID == true
I1112 17:49:00.712555 28116 authenticator.cpp:319] Authentication success
I1112 17:49:00.712708 28128 authenticatee.cpp:300] Authentication success
I1112 17:49:00.721004 28122 master.cpp:5180] Successfully authenticated
principal 'test-principal' at slave(197)@172.17.6.239:32961
I1112 17:49:00.721352 28117 authenticator.cpp:433] Authentication session
cleanup for crammd5_authenticatee(467)@172.17.6.239:32961
Killing the following process trees:
[
--- 1531 sh -c while true; do test -d path1; done
]
I1112 17:49:00.721968 28128 slave.cpp:860] Successfully authenticated with
master [email protected]:32961
I1112 17:49:00.723456 28130 master.cpp:4019] Re-registering slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(196)@172.17.6.239:32961
(f4f9d084d7ba)
I1112 17:49:00.724056 28126 hierarchical.cpp:417] Slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 reactivated
I1112 17:49:00.723914 28130 master.cpp:4207] Sending updated checkpointed
resources disk(role1)[id1:path1]:64 to slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961
(f4f9d084d7ba)
I1112 17:49:00.724306 28128 slave.cpp:1254] Will retry registration in
17.384714ms if necessary
I1112 17:49:00.726639 28128 slave.cpp:1004] Re-registered with master
[email protected]:32961
I1112 17:49:00.726775 28122 master.cpp:3479] Telling slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961
(f4f9d084d7ba) to kill task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at
[email protected]:32961
I1112 17:49:00.726794 28127 status_update_manager.cpp:183] Resuming sending
status updates
I1112 17:49:00.726913 28128 slave.cpp:1040] Forwarding total oversubscribed
resources
I1112 17:49:00.727146 28126 master.cpp:4269] Received update of slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961
(f4f9d084d7ba) with total oversubscribed resources
I1112 17:49:00.727447 28126 hierarchical.cpp:400] Slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 (f4f9d084d7ba) updated with
oversubscribed resources (total: cpus(*):2; mem(*):1024; disk(role1):960;
ports(*):[31000-32000]; disk(role1)[id1:path1]:64, allocated: cpus(*):1;
mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64)
I1112 17:49:00.728248 28126 hierarchical.cpp:1162] Filtered offer with
cpus(*):1; mem(*):896; ports(*):[31000-32000]; disk(role1):928 on slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 for framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.728301 28126 hierarchical.cpp:977] No resources available to
allocate!
I1112 17:49:00.728350 28126 hierarchical.cpp:1070] No inverse offers to send
out!
I1112 17:49:00.728374 28126 hierarchical.cpp:892] Performed allocation for
slave 5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 in 865336ns
I1112 17:49:00.728529 28128 slave.cpp:2178] Updating framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 pid to
[email protected]:32961
I1112 17:49:00.728587 28128 slave.cpp:2194] Checkpointing framework pid
'[email protected]:32961' to
'/tmp/PersistentVolumeTest_SlaveRecovery_LLnwwj/meta/slaves/5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0/frameworks/5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000/framework.pid'
I1112 17:49:00.729202 28119 status_update_manager.cpp:183] Resuming sending
status updates
I1112 17:49:00.729709 28128 slave.cpp:2274] Updated checkpointed resources from
disk(role1)[id1:path1]:64 to disk(role1)[id1:path1]:64
I1112 17:49:00.729821 28128 slave.cpp:1821] Asked to kill task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
W1112 17:49:00.729884 28128 slave.cpp:1948] Ignoring kill task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 because the executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 at executor(1)@172.17.6.239:46865 is
terminating/terminated
I1112 17:49:00.804975 28117 containerizer.cpp:1258] Executor for container
'b9dbc84e-57fa-49d5-840f-8f6d9f0233fc' has exited
I1112 17:49:00.807514 28117 slave.cpp:3553] Executor
'2b35f3fc-373c-4b5f-bbe6-dc3279117145' of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 terminated with signal Killed
I1112 17:49:00.807701 28117 slave.cpp:2762] Handling status update TASK_LOST
(UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from @0.0.0.0:0
I1112 17:49:00.807884 28117 slave.cpp:5298] Terminating task
2b35f3fc-373c-4b5f-bbe6-dc3279117145
W1112 17:49:00.808377 28117 containerizer.cpp:967] Ignoring update for unknown
container: b9dbc84e-57fa-49d5-840f-8f6d9f0233fc
I1112 17:49:00.808811 28117 status_update_manager.cpp:322] Received status
update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.808878 28117 status_update_manager.cpp:826] Checkpointing UPDATE
for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for
task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.814139 28117 status_update_manager.cpp:376] Forwarding update
TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to the slave
I1112 17:49:00.814621 28122 slave.cpp:3087] Forwarding the update TASK_LOST
(UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 to [email protected]:32961
I1112 17:49:00.814888 28122 slave.cpp:3005] Status update manager successfully
handled status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c)
for task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.815172 28117 master.cpp:4414] Status update TASK_LOST (UUID:
0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 from slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961
(f4f9d084d7ba)
I1112 17:49:00.815218 28117 master.cpp:4462] Forwarding status update TASK_LOST
(UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.815402 28117 master.cpp:6066] Updating the state of task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (latest state: TASK_LOST, status
update state: TASK_LOST)
I1112 17:49:00.815636 28122 sched.cpp:921] Scheduler::statusUpdate took 135083ns
I1112 17:49:00.816069 28117 master.cpp:3571] Processing ACKNOWLEDGE call
0c911b27-4aa4-4435-965d-1b4b5033a40c for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 (default) at
[email protected]:32961 on slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0
I1112 17:49:00.816118 28122 hierarchical.cpp:744] Recovered cpus(*):1;
mem(*):128; disk(role1):32; disk(role1)[id1:path1]:64 (total: cpus(*):2;
mem(*):1024; disk(role1):960; ports(*):[31000-32000];
disk(role1)[id1:path1]:64, allocated: ) on slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 from framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.816128 28117 master.cpp:6132] Removing task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 with resources cpus(*):1; mem(*):128;
disk(role1):32; disk(role1)[id1:path1]:64 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000 on slave
5b272a7d-2df2-4bb6-b65f-e2633bea9888-S0 at slave(197)@172.17.6.239:32961
(f4f9d084d7ba)
I1112 17:49:00.816721 28127 status_update_manager.cpp:394] Received status
update acknowledgement (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for task
2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
I1112 17:49:00.817196 28127 status_update_manager.cpp:826] Checkpointing ACK
for status update TASK_LOST (UUID: 0c911b27-4aa4-4435-965d-1b4b5033a40c) for
task 2b35f3fc-373c-4b5f-bbe6-dc3279117145 of framework
5b272a7d-2df2-4bb6-b65f-e2633bea9888-0000
../../src/tests/persistent_volume_tests.cpp:709: Failure
Value of: status2.get().state()
Actual: TASK_LOST
Expected: TASK_KILLED
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)