Benjamin Mahler created MESOS-1331:
--------------------------------------
Summary: SlaveRecoveryTest/0.NonCheckpointingFramework is flaky.
Key: MESOS-1331
URL: https://issues.apache.org/jira/browse/MESOS-1331
Project: Mesos
Issue Type: Bug
Components: test
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Fix For: 0.19.0
{noformat}
RUN ] SlaveRecoveryTest/0.NonCheckpointingFramework
Using temporary directory
'/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_aP60b8'
I0508 20:05:15.992506 37575 leveldb.cpp:174] Opened db in 171.124155ms
I0508 20:05:16.017724 37575 leveldb.cpp:181] Compacted db in 25118us
I0508 20:05:16.017792 37575 leveldb.cpp:196] Created db iterator in 5810ns
I0508 20:05:16.017822 37575 leveldb.cpp:202] Seeked to beginning of db in 1290ns
I0508 20:05:16.017853 37575 leveldb.cpp:271] Iterated through 0 keys in the db
in 587ns
I0508 20:05:16.017899 37575 replica.cpp:741] Replica recovered with log
positions 0 -> 0 with 1 holes and 0 unlearned
I0508 20:05:16.018878 37595 recover.cpp:425] Starting replica recovery
I0508 20:05:16.019269 37592 recover.cpp:451] Replica is in EMPTY status
I0508 20:05:16.020742 37593 replica.cpp:638] Replica in EMPTY status received a
broadcasted recover request
I0508 20:05:16.021250 37595 master.cpp:267] Master
20140508-200516-1740121354-60182-37575 (smfd-bkq-03-sr4.devel.twitter.com)
started on 10.37.184.103:60182
I0508 20:05:16.021291 37595 master.cpp:304] Master only allowing authenticated
frameworks to register
I0508 20:05:16.021306 37595 master.cpp:309] Master only allowing authenticated
slaves to register
I0508 20:05:16.021318 37595 credentials.hpp:35] Loading credentials for
authentication
I0508 20:05:16.021481 37612 recover.cpp:188] Received a recover response from a
replica in EMPTY status
W0508 20:05:16.021528 37595 credentials.hpp:48] Failed to stat credentials file
'file:///tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_aP60b8/credentials':
No such file or directory
I0508 20:05:16.022090 37603 recover.cpp:542] Updating replica status to STARTING
I0508 20:05:16.023128 37606 master.cpp:919] The newly elected leader is
[email protected]:60182 with id 20140508-200516-1740121354-60182-37575
I0508 20:05:16.023175 37606 master.cpp:929] Elected as the leading master!
I0508 20:05:16.023207 37606 master.cpp:750] Recovering from registrar
I0508 20:05:16.023367 37590 registrar.cpp:313] Recovering registrar
I0508 20:05:16.059332 37611 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 36.962934ms
I0508 20:05:16.059419 37611 replica.cpp:320] Persisted replica status to
STARTING
I0508 20:05:16.059635 37611 recover.cpp:451] Replica is in STARTING status
I0508 20:05:16.061924 37598 replica.cpp:638] Replica in STARTING status
received a broadcasted recover request
I0508 20:05:16.062656 37595 recover.cpp:188] Received a recover response from a
replica in STARTING status
I0508 20:05:16.063076 37611 recover.cpp:542] Updating replica status to VOTING
I0508 20:05:16.081363 37592 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 18.158871ms
I0508 20:05:16.081449 37592 replica.cpp:320] Persisted replica status to VOTING
I0508 20:05:16.081564 37592 recover.cpp:556] Successfully joined the Paxos group
I0508 20:05:16.081925 37592 recover.cpp:440] Recover process terminated
I0508 20:05:16.082630 37601 log.cpp:656] Attempting to start the writer
I0508 20:05:16.084640 37601 replica.cpp:474] Replica received implicit promise
request with proposal 1
I0508 20:05:16.089722 37601 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 5.046965ms
I0508 20:05:16.089864 37601 replica.cpp:342] Persisted promised to 1
I0508 20:05:16.090682 37598 coordinator.cpp:230] Coordinator attemping to fill
missing position
I0508 20:05:16.092527 37597 replica.cpp:375] Replica received explicit promise
request for position 0 with proposal 2
I0508 20:05:16.097997 37597 leveldb.cpp:341] Persisting action (8 bytes) to
leveldb took 5.317699ms
I0508 20:05:16.098073 37597 replica.cpp:676] Persisted action at 0
I0508 20:05:16.099889 37591 replica.cpp:508] Replica received write request for
position 0
I0508 20:05:16.099982 37591 leveldb.cpp:436] Reading position from leveldb took
41686ns
I0508 20:05:16.106329 37591 leveldb.cpp:341] Persisting action (14 bytes) to
leveldb took 6.286379ms
I0508 20:05:16.106410 37591 replica.cpp:676] Persisted action at 0
I0508 20:05:16.107161 37592 replica.cpp:655] Replica received learned notice
for position 0
I0508 20:05:16.114692 37592 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 7.478084ms
I0508 20:05:16.114835 37592 replica.cpp:676] Persisted action at 0
I0508 20:05:16.114939 37592 replica.cpp:661] Replica learned NOP action at
position 0
I0508 20:05:16.115711 37604 log.cpp:672] Writer started with ending position 0
I0508 20:05:16.117002 37599 leveldb.cpp:436] Reading position from leveldb took
18438ns
I0508 20:05:16.120064 37606 registrar.cpp:346] Successfully fetched the
registry (0B)
I0508 20:05:16.120125 37606 registrar.cpp:422] Attempting to update the
'registry'
I0508 20:05:16.122580 37608 log.cpp:680] Attempting to append 155 bytes to the
log
I0508 20:05:16.122792 37607 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 1
I0508 20:05:16.123879 37605 replica.cpp:508] Replica received write request for
position 1
I0508 20:05:16.131319 37605 leveldb.cpp:341] Persisting action (174 bytes) to
leveldb took 7.391658ms
I0508 20:05:16.131402 37605 replica.cpp:676] Persisted action at 1
I0508 20:05:16.132503 37595 replica.cpp:655] Replica received learned notice
for position 1
I0508 20:05:16.139683 37595 leveldb.cpp:341] Persisting action (176 bytes) to
leveldb took 7.108266ms
I0508 20:05:16.139770 37595 replica.cpp:676] Persisted action at 1
I0508 20:05:16.139932 37595 replica.cpp:661] Replica learned APPEND action at
position 1
I0508 20:05:16.140727 37598 registrar.cpp:479] Successfully updated 'registry'
I0508 20:05:16.140985 37598 registrar.cpp:372] Successfully recovered registrar
I0508 20:05:16.141175 37591 log.cpp:699] Attempting to truncate the log to 1
I0508 20:05:16.141296 37601 master.cpp:777] Recovered 0 slaves from the
Registry (117B) ; allowing 10mins for slaves to re-register
I0508 20:05:16.141649 37610 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 2
I0508 20:05:16.143034 37609 replica.cpp:508] Replica received write request for
position 2
I0508 20:05:16.144410 37575 mesos_containerizer.cpp:122] Using isolation:
cgroups/cpu,cgroups/mem
I0508 20:05:16.148017 37609 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 4.943311ms
I0508 20:05:16.148066 37609 replica.cpp:676] Persisted action at 2
I0508 20:05:16.148726 37589 replica.cpp:655] Replica received learned notice
for position 2
I0508 20:05:16.156313 37589 leveldb.cpp:341] Persisting action (18 bytes) to
leveldb took 7.453827ms
I0508 20:05:16.156415 37589 leveldb.cpp:399] Deleting ~1 keys from leveldb took
24402ns
I0508 20:05:16.156452 37589 replica.cpp:676] Persisted action at 2
I0508 20:05:16.156484 37589 replica.cpp:661] Replica learned TRUNCATE action at
position 2
I0508 20:05:16.168103 37575 linux_launcher.cpp:66] Using /sys/fs/cgroup/freezer
as the freezer hierarchy for the Linux launcher
I0508 20:05:16.171707 37609 slave.cpp:140] Slave started on
6)@10.37.184.103:60182
I0508 20:05:16.171931 37609 slave.cpp:149] Moving slave process into its own
cgroup
I0508 20:05:16.174898 37575 sched.cpp:121] Version: 0.19.0
I0508 20:05:16.175509 37607 sched.cpp:217] New master detected at
[email protected]:60182
I0508 20:05:16.175560 37607 sched.cpp:268] Authenticating with master
[email protected]:60182
I0508 20:05:16.175889 37606 authenticatee.hpp:128] Creating new client SASL
connection
I0508 20:05:16.176017 37606 master.cpp:2796] Authenticating
scheduler(6)@10.37.184.103:60182
I0508 20:05:16.176208 37599 authenticator.hpp:148] Creating new server SASL
connection
I0508 20:05:16.176324 37599 authenticatee.hpp:219] Received SASL authentication
mechanisms: CRAM-MD5
I0508 20:05:16.176353 37599 authenticatee.hpp:245] Attempting to authenticate
with mechanism 'CRAM-MD5'
I0508 20:05:16.176414 37599 authenticator.hpp:254] Received SASL authentication
start
I0508 20:05:16.176493 37599 authenticator.hpp:342] Authentication requires more
steps
I0508 20:05:16.176558 37599 authenticatee.hpp:265] Received SASL authentication
step
I0508 20:05:16.176681 37603 authenticator.hpp:282] Received SASL authentication
step
I0508 20:05:16.176782 37603 authenticator.hpp:334] Authentication success
I0508 20:05:16.176908 37603 authenticatee.hpp:305] Authentication success
I0508 20:05:16.176935 37601 master.cpp:2836] Successfully authenticated
scheduler(6)@10.37.184.103:60182
I0508 20:05:16.177193 37603 sched.cpp:342] Successfully authenticated with
master [email protected]:60182
I0508 20:05:16.177511 37604 master.cpp:978] Received registration request from
scheduler(6)@10.37.184.103:60182
I0508 20:05:16.177633 37604 master.cpp:996] Registering framework
20140508-200516-1740121354-60182-37575-0000 at scheduler(6)@10.37.184.103:60182
I0508 20:05:16.177810 37604 sched.cpp:392] Framework registered with
20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.177891 37595 hierarchical_allocator_process.hpp:331] Added
framework 20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.180985 37609 slave.cpp:149] Moving slave process into its own
cgroup
I0508 20:05:16.186306 37609 credentials.hpp:35] Loading credentials for
authentication
W0508 20:05:16.186379 37609 credentials.hpp:48] Failed to stat credentials file
'file:///tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/credential':
No such file or directory
I0508 20:05:16.186441 37609 slave.cpp:230] Slave using credential for:
test-principal
I0508 20:05:16.186610 37609 slave.cpp:243] Slave resources: cpus(*):2;
mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0508 20:05:16.187018 37609 slave.cpp:271] Slave hostname:
smfd-bkq-03-sr4.devel.twitter.com
I0508 20:05:16.187150 37609 slave.cpp:272] Slave checkpoint: true
I0508 20:05:16.187824 37597 state.cpp:33] Recovering state from
'/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/meta'
I0508 20:05:16.188226 37612 status_update_manager.cpp:193] Recovering status
update manager
I0508 20:05:16.188405 37612 mesos_containerizer.cpp:279] Recovering
containerizer
I0508 20:05:16.190035 37594 mem.cpp:165] Removing orphaned cgroup
'mesos_test_39687eb8-5053-4094-846f-d737be84f6e4/slave'
I0508 20:05:16.190281 37608 cpushare.cpp:189] Removing orphaned cgroup
'cpuacct/mesos_test_39687eb8-5053-4094-846f-d737be84f6e4/slave'
I0508 20:05:16.192466 37599 slave.cpp:2943] Finished recovery
I0508 20:05:16.192909 37599 slave.cpp:524] New master detected at
[email protected]:60182
I0508 20:05:16.192977 37599 slave.cpp:584] Authenticating with master
[email protected]:60182
I0508 20:05:16.193020 37598 status_update_manager.cpp:167] New master detected
at [email protected]:60182
I0508 20:05:16.193830 37599 slave.cpp:557] Detecting new master
I0508 20:05:16.193846 37589 authenticatee.hpp:128] Creating new client SASL
connection
I0508 20:05:16.194098 37592 master.cpp:2796] Authenticating
slave(6)@10.37.184.103:60182
I0508 20:05:16.194519 37602 authenticator.hpp:148] Creating new server SASL
connection
I0508 20:05:16.194674 37590 authenticatee.hpp:219] Received SASL authentication
mechanisms: CRAM-MD5
I0508 20:05:16.194768 37590 authenticatee.hpp:245] Attempting to authenticate
with mechanism 'CRAM-MD5'
I0508 20:05:16.195200 37607 authenticator.hpp:254] Received SASL authentication
start
I0508 20:05:16.195261 37607 authenticator.hpp:342] Authentication requires more
steps
I0508 20:05:16.195386 37606 authenticatee.hpp:265] Received SASL authentication
step
I0508 20:05:16.195545 37604 authenticator.hpp:282] Received SASL authentication
step
I0508 20:05:16.195613 37604 authenticator.hpp:334] Authentication success
I0508 20:05:16.195688 37606 authenticatee.hpp:305] Authentication success
I0508 20:05:16.195700 37604 master.cpp:2836] Successfully authenticated
slave(6)@10.37.184.103:60182
I0508 20:05:16.196056 37606 slave.cpp:641] Successfully authenticated with
master [email protected]:60182
I0508 20:05:16.196316 37594 master.cpp:2132] Registering slave at
slave(6)@10.37.184.103:60182 (smfd-bkq-03-sr4.devel.twitter.com) with id
20140508-200516-1740121354-60182-37575-0
I0508 20:05:16.196648 37603 registrar.cpp:422] Attempting to update the
'registry'
I0508 20:05:16.199156 37607 log.cpp:680] Attempting to append 382 bytes to the
log
I0508 20:05:16.199429 37603 coordinator.cpp:340] Coordinator attempting to
write APPEND action at position 3
I0508 20:05:16.200638 37602 replica.cpp:508] Replica received write request for
position 3
I0508 20:05:16.222914 37602 leveldb.cpp:341] Persisting action (401 bytes) to
leveldb took 22.226643ms
I0508 20:05:16.222998 37602 replica.cpp:676] Persisted action at 3
I0508 20:05:16.223551 37605 replica.cpp:655] Replica received learned notice
for position 3
I0508 20:05:16.231291 37605 leveldb.cpp:341] Persisting action (403 bytes) to
leveldb took 7.699023ms
I0508 20:05:16.231374 37605 replica.cpp:676] Persisted action at 3
I0508 20:05:16.231400 37605 replica.cpp:661] Replica learned APPEND action at
position 3
I0508 20:05:16.232504 37589 registrar.cpp:479] Successfully updated 'registry'
I0508 20:05:16.232722 37594 log.cpp:699] Attempting to truncate the log to 3
I0508 20:05:16.232893 37591 coordinator.cpp:340] Coordinator attempting to
write TRUNCATE action at position 4
I0508 20:05:16.233067 37610 master.cpp:2172] Registered slave
20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182
(smfd-bkq-03-sr4.devel.twitter.com)
I0508 20:05:16.233126 37610 master.cpp:3281] Adding slave
20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182
(smfd-bkq-03-sr4.devel.twitter.com) with cpus(*):2; mem(*):1024; disk(*):1024;
ports(*):[31000-32000]
I0508 20:05:16.233392 37598 slave.cpp:674] Registered with master
[email protected]:60182; given slave ID
20140508-200516-1740121354-60182-37575-0
I0508 20:05:16.233686 37598 slave.cpp:687] Checkpointing SlaveInfo to
'/tmp/SlaveRecoveryTest_0_NonCheckpointingFramework_TXCwTS/meta/slaves/20140508-200516-1740121354-60182-37575-0/slave.info'
I0508 20:05:16.233767 37600 hierarchical_allocator_process.hpp:444] Added slave
20140508-200516-1740121354-60182-37575-0 (smfd-bkq-03-sr4.devel.twitter.com)
with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
I0508 20:05:16.233808 37605 replica.cpp:508] Replica received write request for
position 4
I0508 20:05:16.234395 37595 master.cpp:2745] Sending 1 offers to framework
20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.237540 37602 master.cpp:1810] Processing reply for offers: [
20140508-200516-1740121354-60182-37575-0 ] on slave
20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182
(smfd-bkq-03-sr4.devel.twitter.com) for framework
20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.237859 37602 master.hpp:584] Adding task
7d10204d-a2f6-4b6c-b4f0-922de0923728 with resources cpus(*):2; mem(*):1024;
disk(*):1024; ports(*):[31000-32000] on slave
20140508-200516-1740121354-60182-37575-0 (smfd-bkq-03-sr4.devel.twitter.com)
I0508 20:05:16.237992 37602 master.cpp:2920] Launching task
7d10204d-a2f6-4b6c-b4f0-922de0923728 of framework
20140508-200516-1740121354-60182-37575-0000 with resources cpus(*):2;
mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave
20140508-200516-1740121354-60182-37575-0 at slave(6)@10.37.184.103:60182
(smfd-bkq-03-sr4.devel.twitter.com)
I0508 20:05:16.238281 37612 slave.cpp:904] Got assigned task
7d10204d-a2f6-4b6c-b4f0-922de0923728 for framework
20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.238638 37612 slave.cpp:1014] Launching task
7d10204d-a2f6-4b6c-b4f0-922de0923728 for framework
20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.239637 37605 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 5.773305ms
I0508 20:05:16.239725 37605 replica.cpp:676] Persisted action at 4
I0508 20:05:16.240603 37594 replica.cpp:655] Replica received learned notice
for position 4
I0508 20:05:16.241828 37592 mesos_containerizer.cpp:523] Starting container
'22acc835-edb6-4771-b9d6-0a11ca542aee' for executor
'7d10204d-a2f6-4b6c-b4f0-922de0923728' of framework
'20140508-200516-1740121354-60182-37575-0000'
I0508 20:05:16.241981 37612 slave.cpp:1124] Queuing task
'7d10204d-a2f6-4b6c-b4f0-922de0923728' for executor
7d10204d-a2f6-4b6c-b4f0-922de0923728 of framework
'20140508-200516-1740121354-60182-37575-0000
I0508 20:05:16.244410 37601 mem.cpp:413] Started listening for OOM events for
container 22acc835-edb6-4771-b9d6-0a11ca542aee
I0508 20:05:16.245115 37601 mem.cpp:277] Updated 'memory.soft_limit_in_bytes'
to 1GB for container 22acc835-edb6-4771-b9d6-0a11ca542aee
I0508 20:05:16.245321 37612 cpushare.cpp:334] Updated 'cpu.shares' to 2048
(cpus 2) for container 22acc835-edb6-4771-b9d6-0a11ca542aee
I0508 20:05:16.246103 37601 mem.cpp:307] Updated 'memory.limit_in_bytes' to 1GB
for container 22acc835-edb6-4771-b9d6-0a11ca542aee
I0508 20:05:16.247835 37591 linux_launcher.cpp:222] Cloning child process with
flags = 0
I0508 20:05:16.247961 37594 leveldb.cpp:341] Persisting action (18 bytes) to
leveldb took 7.292704ms
I0508 20:05:16.248569 37594 leveldb.cpp:399] Deleting ~2 keys from leveldb took
132796ns
I0508 20:05:16.248649 37594 replica.cpp:676] Persisted action at 4
I0508 20:05:16.248749 37594 replica.cpp:661] Replica learned TRUNCATE action at
position 4
I0508 20:05:16.427090 37598 mesos_containerizer.cpp:623] Fetching URIs for
container '22acc835-edb6-4771-b9d6-0a11ca542aee' using command
'/home/bmahler/git/mesos/build/src/mesos-fetcher'
I0508 20:05:16.761432 37594 slave.cpp:2281] Monitoring executor
'7d10204d-a2f6-4b6c-b4f0-922de0923728' of framework
'20140508-200516-1740121354-60182-37575-0000' in container
'22acc835-edb6-4771-b9d6-0a11ca542aee'
../../src/tests/slave_recovery_tests.cpp:1167: Failure
Failed to wait 10secs for update
{noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)