[ 
https://issues.apache.org/jira/browse/MESOS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984668#comment-13984668
 ] 

Ian Downes commented on MESOS-1257:
-----------------------------------

Not sure if this has been observed yet but I'm frequently getting sigpipes:
{noformat}
[ RUN      ] ExternalContainerizerTest.Launch
I0429 19:02:47.925956 23344 leveldb.cpp:174] Opened db in 151.511471ms
[New Thread 0x7fffe63af940 (LWP 23372)]
I0429 19:02:47.946646 23344 leveldb.cpp:181] Compacted db in 20.633532ms
I0429 19:02:47.946713 23344 leveldb.cpp:196] Created db iterator in 18202ns
I0429 19:02:47.946743 23344 leveldb.cpp:202] Seeked to beginning of db in 3183ns
I0429 19:02:47.946784 23344 leveldb.cpp:271] Iterated through 0 keys in the db 
in 8205ns
I0429 19:02:47.946843 23344 replica.cpp:729] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0429 19:02:47.948122 23351 recover.cpp:425] Starting replica recovery
I0429 19:02:47.948361 23351 recover.cpp:451] Replica is in EMPTY status
I0429 19:02:47.949110 23368 master.cpp:266] Master 
20140429-190247-1828659978-51791-23344 (smfd-atr-11-sr1.devel.twitter.com) 
started on 10.35.255.108:51791
I0429 19:02:47.949158 23368 master.cpp:303] Master only allowing authenticated 
frameworks to register
I0429 19:02:47.949169 23368 master.cpp:308] Master only allowing authenticated 
slaves to register
I0429 19:02:47.949180 23368 credentials.hpp:35] Loading credentials for 
authentication
W0429 19:02:47.949239 23368 credentials.hpp:48] Failed to stat credentials file 
'file:///tmp/ExternalContainerizerTest_Launch_9u7rry/credentials': No such file 
or directory
I0429 19:02:47.949404 23370 replica.cpp:626] Replica in EMPTY status received a 
broadcasted recover request
I0429 19:02:47.949659 23370 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0429 19:02:47.950084 23370 recover.cpp:542] Updating replica status to STARTING
I0429 19:02:47.951776 23368 master.cpp:922] The newly elected leader is 
[email protected]:51791 with id 20140429-190247-1828659978-51791-23344
I0429 19:02:47.951800 23368 master.cpp:932] Elected as the leading master!
I0429 19:02:47.951810 23368 master.cpp:753] Recovering from registrar
I0429 19:02:47.951895 23368 registrar.cpp:275] Recovering registrar
I0429 19:02:48.001050 23367 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 50.878454ms
I0429 19:02:48.001092 23367 replica.cpp:320] Persisted replica status to 
STARTING
I0429 19:02:48.001188 23367 recover.cpp:451] Replica is in STARTING status
I0429 19:02:48.002086 23354 replica.cpp:626] Replica in STARTING status 
received a broadcasted recover request
I0429 19:02:48.002476 23356 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0429 19:02:48.002647 23359 recover.cpp:542] Updating replica status to VOTING
I0429 19:02:48.051146 23354 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 47.933194ms
I0429 19:02:48.051190 23354 replica.cpp:320] Persisted replica status to VOTING
I0429 19:02:48.051234 23354 recover.cpp:556] Successfully joined the Paxos group
I0429 19:02:48.051440 23354 recover.cpp:440] Recover process terminated
I0429 19:02:48.051533 23359 log.cpp:656] Attempting to start the writer
I0429 19:02:48.052250 23360 replica.cpp:474] Replica received implicit promise 
request with proposal 1
I0429 19:02:48.071349 23360 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 19.070359ms
I0429 19:02:48.071385 23360 replica.cpp:342] Persisted promised to 1
I0429 19:02:48.072234 23370 coordinator.cpp:229] Coordinator attemping to fill 
missing position
I0429 19:02:48.072969 23355 replica.cpp:375] Replica received explicit promise 
request for position 0 with proposal 2
I0429 19:02:48.088035 23355 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 15.027565ms
I0429 19:02:48.088073 23355 replica.cpp:664] Persisted action at 0
I0429 19:02:48.089238 23350 replica.cpp:508] Replica received write request for 
position 0
I0429 19:02:48.089310 23350 leveldb.cpp:436] Reading position from leveldb took 
45201ns
I0429 19:02:48.104656 23350 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 15.319464ms
I0429 19:02:48.104692 23350 replica.cpp:664] Persisted action at 0
I0429 19:02:48.105082 23360 replica.cpp:643] Replica received learned notice 
for position 0
I0429 19:02:48.112995 23360 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 7.883001ms
I0429 19:02:48.113030 23360 replica.cpp:664] Persisted action at 0
I0429 19:02:48.113045 23360 replica.cpp:649] Replica learned NOP action at 
position 0
I0429 19:02:48.113327 23360 log.cpp:672] Writer started with ending position 0
I0429 19:02:48.113997 23355 leveldb.cpp:436] Reading position from leveldb took 
11161ns
I0429 19:02:48.115602 23362 registrar.cpp:308] Successfully recovered registrar
I0429 19:02:48.115663 23362 registrar.cpp:379] Attempting to update the 
'registry'
I0429 19:02:48.117117 23364 log.cpp:680] Attempting to append 155 bytes to the 
log
I0429 19:02:48.117203 23351 coordinator.cpp:339] Coordinator attempting to 
write APPEND action at position 1
I0429 19:02:48.118144 23348 replica.cpp:508] Replica received write request for 
position 1
I0429 19:02:48.129664 23348 leveldb.cpp:341] Persisting action (174 bytes) to 
leveldb took 11.475436ms
I0429 19:02:48.129714 23348 replica.cpp:664] Persisted action at 1
I0429 19:02:48.129935 23347 replica.cpp:643] Replica received learned notice 
for position 1
I0429 19:02:48.137984 23347 leveldb.cpp:341] Persisting action (176 bytes) to 
leveldb took 8.01972ms
I0429 19:02:48.138020 23347 replica.cpp:664] Persisted action at 1
I0429 19:02:48.138033 23347 replica.cpp:649] Replica learned APPEND action at 
position 1
I0429 19:02:48.138408 23348 log.cpp:699] Attempting to truncate the log to 1
I0429 19:02:48.138439 23347 registrar.cpp:427] Successfully updated 'registry'
I0429 19:02:48.138521 23363 coordinator.cpp:339] Coordinator attempting to 
write TRUNCATE action at position 2
I0429 19:02:48.138788 23358 master.cpp:780] Recovered 0 slaves from the 
Registry (117B) ; allowing 10mins for slaves to re-register
I0429 19:02:48.139235 23366 replica.cpp:508] Replica received write request for 
position 2
I0429 19:02:48.142236 23370 slave.cpp:140] Slave started on 
1)@10.35.255.108:51791
I0429 19:02:48.142354 23370 slave.cpp:149] Moving slave process into its own 
cgroup
I0429 19:02:48.144512 23344 sched.cpp:121] Version: 0.19.0
I0429 19:02:48.144803 23361 sched.cpp:217] New master detected at 
[email protected]:51791
I0429 19:02:48.144830 23361 sched.cpp:268] Authenticating with master 
[email protected]:51791
I0429 19:02:48.145000 23360 authenticatee.hpp:104] Initializing client SASL
I0429 19:02:48.159816 23366 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 20.554599ms
I0429 19:02:48.159843 23366 replica.cpp:664] Persisted action at 2
I0429 19:02:48.160336 23359 replica.cpp:643] Replica received learned notice 
for position 2
I0429 19:02:48.173491 23359 leveldb.cpp:341] Persisting action (18 bytes) to 
leveldb took 13.123099ms
I0429 19:02:48.173598 23359 leveldb.cpp:399] Deleting ~1 keys from leveldb took 
77419ns
I0429 19:02:48.173617 23359 replica.cpp:664] Persisted action at 2
I0429 19:02:48.173627 23359 replica.cpp:649] Replica learned TRUNCATE action at 
position 2
I0429 19:02:48.219429 23360 authenticatee.hpp:128] Creating new client SASL 
connection
I0429 19:02:48.219591 23360 master.cpp:2795] Authenticating 
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.220036 23369 authenticator.hpp:91] Initializing server SASL
I0429 19:02:48.220402 23369 authenticator.hpp:148] Creating new server SASL 
connection
I0429 19:02:48.220486 23369 authenticatee.hpp:219] Received SASL authentication 
mechanisms: CRAM-MD5
I0429 19:02:48.220504 23369 authenticatee.hpp:245] Attempting to authenticate 
with mechanism 'CRAM-MD5'
I0429 19:02:48.220532 23369 authenticator.hpp:254] Received SASL authentication 
start
I0429 19:02:48.220577 23369 authenticator.hpp:342] Authentication requires more 
steps
I0429 19:02:48.220607 23369 authenticatee.hpp:265] Received SASL authentication 
step
I0429 19:02:48.221308 23355 authenticator.hpp:282] Received SASL authentication 
step
I0429 19:02:48.221385 23355 authenticator.hpp:334] Authentication success
I0429 19:02:48.221477 23364 authenticatee.hpp:305] Authentication success
I0429 19:02:48.221565 23364 sched.cpp:342] Successfully authenticated with 
master [email protected]:51791
I0429 19:02:48.221887 23365 master.cpp:2835] Successfully authenticated 
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222167 23365 master.cpp:981] Received registration request from 
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222277 23365 master.cpp:999] Registering framework 
20140429-190247-1828659978-51791-23344-0000 at scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222543 23365 sched.cpp:392] Framework registered with 
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.222578 23358 hierarchical_allocator_process.hpp:332] Added 
framework 20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.234244 23370 slave.cpp:149] Moving slave process into its own 
cgroup
I0429 19:02:48.249811 23370 credentials.hpp:35] Loading credentials for 
authentication
W0429 19:02:48.249902 23370 credentials.hpp:48] Failed to stat credentials file 
'file:///tmp/ExternalContainerizerTest_Launch_9UOmf5/credential': No such file 
or directory
I0429 19:02:48.249933 23370 slave.cpp:231] Slave using credential for: 
test-principal
I0429 19:02:48.250079 23370 slave.cpp:244] Slave resources: cpus(*):2; 
mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0429 19:02:48.250519 23370 slave.cpp:272] Slave hostname: 
smfd-atr-11-sr1.devel.twitter.com
I0429 19:02:48.250617 23370 slave.cpp:273] Slave checkpoint: false
I0429 19:02:48.251158 23367 state.cpp:33] Recovering state from 
'/tmp/ExternalContainerizerTest_Launch_9UOmf5/meta'
I0429 19:02:48.251363 23347 status_update_manager.cpp:193] Recovering status 
update manager
I0429 19:02:48.251705 23365 slave.cpp:2943] Finished recovery
I0429 19:02:48.252096 23368 slave.cpp:525] New master detected at 
[email protected]:51791
I0429 19:02:48.252133 23368 slave.cpp:585] Authenticating with master 
[email protected]:51791
I0429 19:02:48.252181 23368 slave.cpp:558] Detecting new master
I0429 19:02:48.252148 23365 status_update_manager.cpp:167] New master detected 
at [email protected]:51791
I0429 19:02:48.252198 23364 authenticatee.hpp:128] Creating new client SASL 
connection
I0429 19:02:48.252482 23350 master.cpp:2795] Authenticating 
slave(1)@10.35.255.108:51791
I0429 19:02:48.252596 23353 authenticator.hpp:148] Creating new server SASL 
connection
I0429 19:02:48.252665 23350 authenticatee.hpp:219] Received SASL authentication 
mechanisms: CRAM-MD5
I0429 19:02:48.252691 23350 authenticatee.hpp:245] Attempting to authenticate 
with mechanism 'CRAM-MD5'
I0429 19:02:48.252723 23350 authenticator.hpp:254] Received SASL authentication 
start
I0429 19:02:48.252773 23350 authenticator.hpp:342] Authentication requires more 
steps
I0429 19:02:48.252910 23364 authenticatee.hpp:265] Received SASL authentication 
step
I0429 19:02:48.253244 23370 authenticator.hpp:282] Received SASL authentication 
step
I0429 19:02:48.253412 23370 authenticator.hpp:334] Authentication success
I0429 19:02:48.253463 23368 authenticatee.hpp:305] Authentication success
I0429 19:02:48.253473 23370 master.cpp:2835] Successfully authenticated 
slave(1)@10.35.255.108:51791
I0429 19:02:48.254004 23352 slave.cpp:642] Successfully authenticated with 
master [email protected]:51791
I0429 19:02:48.254516 23351 registrar.cpp:379] Attempting to update the 
'registry'
I0429 19:02:48.255900 23359 log.cpp:680] Attempting to append 382 bytes to the 
log
I0429 19:02:48.255970 23353 coordinator.cpp:339] Coordinator attempting to 
write APPEND action at position 3
I0429 19:02:48.256436 23365 replica.cpp:508] Replica received write request for 
position 3
I0429 19:02:48.271217 23365 leveldb.cpp:341] Persisting action (401 bytes) to 
leveldb took 14.752527ms
I0429 19:02:48.271253 23365 replica.cpp:664] Persisted action at 3
I0429 19:02:48.271538 23367 replica.cpp:643] Replica received learned notice 
for position 3
I0429 19:02:48.279605 23367 leveldb.cpp:341] Persisting action (403 bytes) to 
leveldb took 8.029174ms
I0429 19:02:48.279642 23367 replica.cpp:664] Persisted action at 3
I0429 19:02:48.279657 23367 replica.cpp:649] Replica learned APPEND action at 
position 3
I0429 19:02:48.280014 23353 registrar.cpp:427] Successfully updated 'registry'
I0429 19:02:48.280084 23352 log.cpp:699] Attempting to truncate the log to 3
I0429 19:02:48.280169 23352 coordinator.cpp:339] Coordinator attempting to 
write TRUNCATE action at position 4
I0429 19:02:48.280195 23364 master.cpp:2169] Admitted slave on 
smfd-atr-11-sr1.devel.twitter.com at slave(1)@10.35.255.108:51791
I0429 19:02:48.280218 23364 master.cpp:3283] Adding slave 
20140429-190247-1828659978-51791-23344-0 at smfd-atr-11-sr1.devel.twitter.com 
with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0429 19:02:48.280336 23348 slave.cpp:675] Registered with master 
[email protected]:51791; given slave ID 
20140429-190247-1828659978-51791-23344-0
I0429 19:02:48.280750 23360 hierarchical_allocator_process.hpp:445] Added slave 
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com) 
with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and 
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
I0429 19:02:48.280879 23350 replica.cpp:508] Replica received write request for 
position 4
I0429 19:02:48.281040 23360 master.cpp:2744] Sending 1 offers to framework 
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.281927 23366 master.cpp:1806] Processing reply for offers: [ 
20140429-190247-1828659978-51791-23344-0 ] on slave 
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com) 
for framework 20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.282032 23366 master.hpp:558] Adding task 1 with resources 
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
I0429 19:02:48.282088 23366 master.cpp:2919] Launching task 1 of framework 
20140429-190247-1828659978-51791-23344-0000 with resources cpus(*):2; 
mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave 
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
I0429 19:02:48.282248 23368 slave.cpp:905] Got assigned task 1 for framework 
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.282518 23368 slave.cpp:1015] Launching task 1 for framework 
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.284199 23367 external_containerizer.cpp:266] Launching container 
'ad9265ce-1161-42c5-b816-4976df51c2ef'
I0429 19:02:48.284247 23368 slave.cpp:1125] Queuing task '1' for executor 1 of 
framework '20140429-190247-1828659978-51791-23344-0000
Detaching after fork from child process 23373.
I0429 19:02:48.289022 23350 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 8.068729ms
I0429 19:02:48.289068 23350 replica.cpp:664] Persisted action at 4
I0429 19:02:48.289842 23360 replica.cpp:643] Replica received learned notice 
for position 4
Detaching after fork from child process 23374.
I0429 19:02:48.297219 23360 leveldb.cpp:341] Persisting action (18 bytes) to 
leveldb took 7.275821ms
I0429 19:02:48.297343 23360 leveldb.cpp:399] Deleting ~2 keys from leveldb took 
80455ns
I0429 19:02:48.297401 23360 replica.cpp:664] Persisted action at 4
I0429 19:02:48.297437 23360 replica.cpp:649] Replica learned TRUNCATE action at 
position 4
Detaching after fork from child process 23375.

Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffe8fc4940 (LWP 23367)]
0x00007ffff64cea2b in write () from /lib64/libpthread.so.0
{noformat}

{noformat}
(gdb) bt
#0  0x00007ffff64cea2b in write () from /lib64/libpthread.so.0
#1  0x00007ffff6cbeefb in write (fd=16, message=...) at 
../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:321
#2  protobuf::write (fd=16, message=...) at 
../../3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp:57
#3  0x00007ffff6d65a29 in 
mesos::internal::slave::ExternalContainerizerProcess::invoke (this=Unhandled 
dwarf expression opcode 0xf3
) at ../../src/slave/containerizer/external_containerizer.cpp:883
#4  0x00007ffff6d6ba95 in 
mesos::internal::slave::ExternalContainerizerProcess::launch 
(this=0x7fffe0069080, containerId=..., taskInfo=Unhandled dwarf expression 
opcode 0xf3
) at ../../src/slave/containerizer/external_containerizer.cpp:313
#5  0x00007ffff6d7bb66 in operator() (__functor=Unhandled dwarf expression 
opcode 0xf3
) at ../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:248
#6  std::_Function_handler<void(process::ProcessBase*), process::dispatch(const 
process::PID<T>&, process::Future<T> (T::*)(P0, P1, P2, P3, P4, P5, P6, P7), 
A0, A1, A2, A3, A4, A5, A6, A7) [with R = Nothing; T = 
mesos::internal::slave::ExternalContainerizerProcess; P0 = const 
mesos::ContainerID&; P1 = const Option<mesos::TaskInfo>&; P2 = const 
mesos::ExecutorInfo&; P3 = const std::basic_string<char>&; P4 = const 
Option<std::basic_string<char> >&; P5 = const mesos::SlaveID&; P6 = const 
process::PID<mesos::internal::slave::Slave>&; P7 = bool; A0 = 
mesos::ContainerID; A1 = mesos::TaskInfo; A2 = mesos::ExecutorInfo; A3 = 
std::basic_string<char>; A4 = Option<std::basic_string<char> >; A5 = 
mesos::SlaveID; A6 = process::PID<mesos::internal::slave::Slave>; A7 = 
bool]::__lambda28>::_M_invoke(const std::_Any_data &, process::ProcessBase *) 
(__functor=Unhandled dwarf expression opcode 0xf3
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.1/functional:2071
#7  0x00007ffff6fc1d92 in process::ProcessManager::resume (this=0xc34680, 
process=0x7fffe0069240) at ../../../3rdparty/libprocess/src/process.cpp:2623
#8  0x00007ffff6fc208c in process::schedule (arg=Unhandled dwarf expression 
opcode 0xf3
) at ../../../3rdparty/libprocess/src/process.cpp:1310
#9  0x00007ffff64c783d in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff522f26d in clone () from /lib64/libc.so.6
{noformat}

{noformat}
$ gcc --version                                                                 
                                                                       
│[1201][idownes:mesos]$
gcc (GCC) 4.8.1 20130715 (Red Hat 4.8.1-4)                                      
                                                                                
                      │
Copyright (C) 2013 Free Software Foundation, Inc.                               
                                                                                
                      │
This is free software; see the source for copying conditions.  There is NO      
                                                                                
                      │
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
{noformat}


> ExternalContainerizerTest.Launch is flaky
> -----------------------------------------
>
>                 Key: MESOS-1257
>                 URL: https://issues.apache.org/jira/browse/MESOS-1257
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jie Yu
>            Assignee: Till Toenshoff
>
> {noformat}
> [----------] 1 test from ExternalContainerizerTest
> [ RUN      ] ExternalContainerizerTest.Launch
> 2014-04-28 
> 17:04:14,345:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2014-04-28 
> 17:04:17,681:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> 2014-04-28 
> 17:04:21,018:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697: 
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server 
> refused to accept the client
> ../../../mesos/src/tests/external_containerizer_test.cpp:202: Failure
> Failed to wait 10secs for status
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to