[
https://issues.apache.org/jira/browse/MESOS-1257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984668#comment-13984668
]
Ian Downes commented on MESOS-1257:
-----------------------------------
Not sure if this has been observed yet but I'm frequently getting sigpipes:
{noformat}
[ RUN ] ExternalContainerizerTest.Launch
I0429 19:02:47.925956 23344 leveldb.cpp:174] Opened db in 151.511471ms
[New Thread 0x7fffe63af940 (LWP 23372)]
I0429 19:02:47.946646 23344 leveldb.cpp:181] Compacted db in 20.633532ms
I0429 19:02:47.946713 23344 leveldb.cpp:196] Created db iterator in 18202ns
I0429 19:02:47.946743 23344 leveldb.cpp:202] Seeked to beginning of db in 3183ns
I0429 19:02:47.946784 23344 leveldb.cpp:271] Iterated through 0 keys in the db
in 8205ns
I0429 19:02:47.946843 23344 replica.cpp:729] Replica recovered with log
positions 0 -> 0 with 1 holes and 0 unlearned
I0429 19:02:47.948122 23351 recover.cpp:425] Starting replica recovery
I0429 19:02:47.948361 23351 recover.cpp:451] Replica is in EMPTY status
I0429 19:02:47.949110 23368 master.cpp:266] Master
20140429-190247-1828659978-51791-23344 (smfd-atr-11-sr1.devel.twitter.com)
started on 10.35.255.108:51791
I0429 19:02:47.949158 23368 master.cpp:303] Master only allowing authenticated
frameworks to register
I0429 19:02:47.949169 23368 master.cpp:308] Master only allowing authenticated
slaves to register
I0429 19:02:47.949180 23368 credentials.hpp:35] Loading credentials for
authentication
W0429 19:02:47.949239 23368 credentials.hpp:48] Failed to stat credentials file
'file:///tmp/ExternalContainerizerTest_Launch_9u7rry/credentials': No such file
or directory
I0429 19:02:47.949404 23370 replica.cpp:626] Replica in EMPTY status received a
broadcasted recover request
I0429 19:02:47.949659 23370 recover.cpp:188] Received a recover response from a
replica in EMPTY status
I0429 19:02:47.950084 23370 recover.cpp:542] Updating replica status to STARTING
I0429 19:02:47.951776 23368 master.cpp:922] The newly elected leader is
[email protected]:51791 with id 20140429-190247-1828659978-51791-23344
I0429 19:02:47.951800 23368 master.cpp:932] Elected as the leading master!
I0429 19:02:47.951810 23368 master.cpp:753] Recovering from registrar
I0429 19:02:47.951895 23368 registrar.cpp:275] Recovering registrar
I0429 19:02:48.001050 23367 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 50.878454ms
I0429 19:02:48.001092 23367 replica.cpp:320] Persisted replica status to
STARTING
I0429 19:02:48.001188 23367 recover.cpp:451] Replica is in STARTING status
I0429 19:02:48.002086 23354 replica.cpp:626] Replica in STARTING status
received a broadcasted recover request
I0429 19:02:48.002476 23356 recover.cpp:188] Received a recover response from a
replica in STARTING status
I0429 19:02:48.002647 23359 recover.cpp:542] Updating replica status to VOTING
I0429 19:02:48.051146 23354 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 47.933194ms
I0429 19:02:48.051190 23354 replica.cpp:320] Persisted replica status to VOTING
I0429 19:02:48.051234 23354 recover.cpp:556] Successfully joined the Paxos group
I0429 19:02:48.051440 23354 recover.cpp:440] Recover process terminated
I0429 19:02:48.051533 23359 log.cpp:656] Attempting to start the writer
I0429 19:02:48.052250 23360 replica.cpp:474] Replica received implicit promise
request with proposal 1
I0429 19:02:48.071349 23360 leveldb.cpp:304] Persisting metadata (8 bytes) to
leveldb took 19.070359ms
I0429 19:02:48.071385 23360 replica.cpp:342] Persisted promised to 1
I0429 19:02:48.072234 23370 coordinator.cpp:229] Coordinator attemping to fill
missing position
I0429 19:02:48.072969 23355 replica.cpp:375] Replica received explicit promise
request for position 0 with proposal 2
I0429 19:02:48.088035 23355 leveldb.cpp:341] Persisting action (8 bytes) to
leveldb took 15.027565ms
I0429 19:02:48.088073 23355 replica.cpp:664] Persisted action at 0
I0429 19:02:48.089238 23350 replica.cpp:508] Replica received write request for
position 0
I0429 19:02:48.089310 23350 leveldb.cpp:436] Reading position from leveldb took
45201ns
I0429 19:02:48.104656 23350 leveldb.cpp:341] Persisting action (14 bytes) to
leveldb took 15.319464ms
I0429 19:02:48.104692 23350 replica.cpp:664] Persisted action at 0
I0429 19:02:48.105082 23360 replica.cpp:643] Replica received learned notice
for position 0
I0429 19:02:48.112995 23360 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 7.883001ms
I0429 19:02:48.113030 23360 replica.cpp:664] Persisted action at 0
I0429 19:02:48.113045 23360 replica.cpp:649] Replica learned NOP action at
position 0
I0429 19:02:48.113327 23360 log.cpp:672] Writer started with ending position 0
I0429 19:02:48.113997 23355 leveldb.cpp:436] Reading position from leveldb took
11161ns
I0429 19:02:48.115602 23362 registrar.cpp:308] Successfully recovered registrar
I0429 19:02:48.115663 23362 registrar.cpp:379] Attempting to update the
'registry'
I0429 19:02:48.117117 23364 log.cpp:680] Attempting to append 155 bytes to the
log
I0429 19:02:48.117203 23351 coordinator.cpp:339] Coordinator attempting to
write APPEND action at position 1
I0429 19:02:48.118144 23348 replica.cpp:508] Replica received write request for
position 1
I0429 19:02:48.129664 23348 leveldb.cpp:341] Persisting action (174 bytes) to
leveldb took 11.475436ms
I0429 19:02:48.129714 23348 replica.cpp:664] Persisted action at 1
I0429 19:02:48.129935 23347 replica.cpp:643] Replica received learned notice
for position 1
I0429 19:02:48.137984 23347 leveldb.cpp:341] Persisting action (176 bytes) to
leveldb took 8.01972ms
I0429 19:02:48.138020 23347 replica.cpp:664] Persisted action at 1
I0429 19:02:48.138033 23347 replica.cpp:649] Replica learned APPEND action at
position 1
I0429 19:02:48.138408 23348 log.cpp:699] Attempting to truncate the log to 1
I0429 19:02:48.138439 23347 registrar.cpp:427] Successfully updated 'registry'
I0429 19:02:48.138521 23363 coordinator.cpp:339] Coordinator attempting to
write TRUNCATE action at position 2
I0429 19:02:48.138788 23358 master.cpp:780] Recovered 0 slaves from the
Registry (117B) ; allowing 10mins for slaves to re-register
I0429 19:02:48.139235 23366 replica.cpp:508] Replica received write request for
position 2
I0429 19:02:48.142236 23370 slave.cpp:140] Slave started on
1)@10.35.255.108:51791
I0429 19:02:48.142354 23370 slave.cpp:149] Moving slave process into its own
cgroup
I0429 19:02:48.144512 23344 sched.cpp:121] Version: 0.19.0
I0429 19:02:48.144803 23361 sched.cpp:217] New master detected at
[email protected]:51791
I0429 19:02:48.144830 23361 sched.cpp:268] Authenticating with master
[email protected]:51791
I0429 19:02:48.145000 23360 authenticatee.hpp:104] Initializing client SASL
I0429 19:02:48.159816 23366 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 20.554599ms
I0429 19:02:48.159843 23366 replica.cpp:664] Persisted action at 2
I0429 19:02:48.160336 23359 replica.cpp:643] Replica received learned notice
for position 2
I0429 19:02:48.173491 23359 leveldb.cpp:341] Persisting action (18 bytes) to
leveldb took 13.123099ms
I0429 19:02:48.173598 23359 leveldb.cpp:399] Deleting ~1 keys from leveldb took
77419ns
I0429 19:02:48.173617 23359 replica.cpp:664] Persisted action at 2
I0429 19:02:48.173627 23359 replica.cpp:649] Replica learned TRUNCATE action at
position 2
I0429 19:02:48.219429 23360 authenticatee.hpp:128] Creating new client SASL
connection
I0429 19:02:48.219591 23360 master.cpp:2795] Authenticating
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.220036 23369 authenticator.hpp:91] Initializing server SASL
I0429 19:02:48.220402 23369 authenticator.hpp:148] Creating new server SASL
connection
I0429 19:02:48.220486 23369 authenticatee.hpp:219] Received SASL authentication
mechanisms: CRAM-MD5
I0429 19:02:48.220504 23369 authenticatee.hpp:245] Attempting to authenticate
with mechanism 'CRAM-MD5'
I0429 19:02:48.220532 23369 authenticator.hpp:254] Received SASL authentication
start
I0429 19:02:48.220577 23369 authenticator.hpp:342] Authentication requires more
steps
I0429 19:02:48.220607 23369 authenticatee.hpp:265] Received SASL authentication
step
I0429 19:02:48.221308 23355 authenticator.hpp:282] Received SASL authentication
step
I0429 19:02:48.221385 23355 authenticator.hpp:334] Authentication success
I0429 19:02:48.221477 23364 authenticatee.hpp:305] Authentication success
I0429 19:02:48.221565 23364 sched.cpp:342] Successfully authenticated with
master [email protected]:51791
I0429 19:02:48.221887 23365 master.cpp:2835] Successfully authenticated
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222167 23365 master.cpp:981] Received registration request from
scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222277 23365 master.cpp:999] Registering framework
20140429-190247-1828659978-51791-23344-0000 at scheduler(1)@10.35.255.108:51791
I0429 19:02:48.222543 23365 sched.cpp:392] Framework registered with
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.222578 23358 hierarchical_allocator_process.hpp:332] Added
framework 20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.234244 23370 slave.cpp:149] Moving slave process into its own
cgroup
I0429 19:02:48.249811 23370 credentials.hpp:35] Loading credentials for
authentication
W0429 19:02:48.249902 23370 credentials.hpp:48] Failed to stat credentials file
'file:///tmp/ExternalContainerizerTest_Launch_9UOmf5/credential': No such file
or directory
I0429 19:02:48.249933 23370 slave.cpp:231] Slave using credential for:
test-principal
I0429 19:02:48.250079 23370 slave.cpp:244] Slave resources: cpus(*):2;
mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0429 19:02:48.250519 23370 slave.cpp:272] Slave hostname:
smfd-atr-11-sr1.devel.twitter.com
I0429 19:02:48.250617 23370 slave.cpp:273] Slave checkpoint: false
I0429 19:02:48.251158 23367 state.cpp:33] Recovering state from
'/tmp/ExternalContainerizerTest_Launch_9UOmf5/meta'
I0429 19:02:48.251363 23347 status_update_manager.cpp:193] Recovering status
update manager
I0429 19:02:48.251705 23365 slave.cpp:2943] Finished recovery
I0429 19:02:48.252096 23368 slave.cpp:525] New master detected at
[email protected]:51791
I0429 19:02:48.252133 23368 slave.cpp:585] Authenticating with master
[email protected]:51791
I0429 19:02:48.252181 23368 slave.cpp:558] Detecting new master
I0429 19:02:48.252148 23365 status_update_manager.cpp:167] New master detected
at [email protected]:51791
I0429 19:02:48.252198 23364 authenticatee.hpp:128] Creating new client SASL
connection
I0429 19:02:48.252482 23350 master.cpp:2795] Authenticating
slave(1)@10.35.255.108:51791
I0429 19:02:48.252596 23353 authenticator.hpp:148] Creating new server SASL
connection
I0429 19:02:48.252665 23350 authenticatee.hpp:219] Received SASL authentication
mechanisms: CRAM-MD5
I0429 19:02:48.252691 23350 authenticatee.hpp:245] Attempting to authenticate
with mechanism 'CRAM-MD5'
I0429 19:02:48.252723 23350 authenticator.hpp:254] Received SASL authentication
start
I0429 19:02:48.252773 23350 authenticator.hpp:342] Authentication requires more
steps
I0429 19:02:48.252910 23364 authenticatee.hpp:265] Received SASL authentication
step
I0429 19:02:48.253244 23370 authenticator.hpp:282] Received SASL authentication
step
I0429 19:02:48.253412 23370 authenticator.hpp:334] Authentication success
I0429 19:02:48.253463 23368 authenticatee.hpp:305] Authentication success
I0429 19:02:48.253473 23370 master.cpp:2835] Successfully authenticated
slave(1)@10.35.255.108:51791
I0429 19:02:48.254004 23352 slave.cpp:642] Successfully authenticated with
master [email protected]:51791
I0429 19:02:48.254516 23351 registrar.cpp:379] Attempting to update the
'registry'
I0429 19:02:48.255900 23359 log.cpp:680] Attempting to append 382 bytes to the
log
I0429 19:02:48.255970 23353 coordinator.cpp:339] Coordinator attempting to
write APPEND action at position 3
I0429 19:02:48.256436 23365 replica.cpp:508] Replica received write request for
position 3
I0429 19:02:48.271217 23365 leveldb.cpp:341] Persisting action (401 bytes) to
leveldb took 14.752527ms
I0429 19:02:48.271253 23365 replica.cpp:664] Persisted action at 3
I0429 19:02:48.271538 23367 replica.cpp:643] Replica received learned notice
for position 3
I0429 19:02:48.279605 23367 leveldb.cpp:341] Persisting action (403 bytes) to
leveldb took 8.029174ms
I0429 19:02:48.279642 23367 replica.cpp:664] Persisted action at 3
I0429 19:02:48.279657 23367 replica.cpp:649] Replica learned APPEND action at
position 3
I0429 19:02:48.280014 23353 registrar.cpp:427] Successfully updated 'registry'
I0429 19:02:48.280084 23352 log.cpp:699] Attempting to truncate the log to 3
I0429 19:02:48.280169 23352 coordinator.cpp:339] Coordinator attempting to
write TRUNCATE action at position 4
I0429 19:02:48.280195 23364 master.cpp:2169] Admitted slave on
smfd-atr-11-sr1.devel.twitter.com at slave(1)@10.35.255.108:51791
I0429 19:02:48.280218 23364 master.cpp:3283] Adding slave
20140429-190247-1828659978-51791-23344-0 at smfd-atr-11-sr1.devel.twitter.com
with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000]
I0429 19:02:48.280336 23348 slave.cpp:675] Registered with master
[email protected]:51791; given slave ID
20140429-190247-1828659978-51791-23344-0
I0429 19:02:48.280750 23360 hierarchical_allocator_process.hpp:445] Added slave
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
with cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] (and
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] available)
I0429 19:02:48.280879 23350 replica.cpp:508] Replica received write request for
position 4
I0429 19:02:48.281040 23360 master.cpp:2744] Sending 1 offers to framework
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.281927 23366 master.cpp:1806] Processing reply for offers: [
20140429-190247-1828659978-51791-23344-0 ] on slave
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
for framework 20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.282032 23366 master.hpp:558] Adding task 1 with resources
cpus(*):2; mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
I0429 19:02:48.282088 23366 master.cpp:2919] Launching task 1 of framework
20140429-190247-1828659978-51791-23344-0000 with resources cpus(*):2;
mem(*):1024; disk(*):1024; ports(*):[31000-32000] on slave
20140429-190247-1828659978-51791-23344-0 (smfd-atr-11-sr1.devel.twitter.com)
I0429 19:02:48.282248 23368 slave.cpp:905] Got assigned task 1 for framework
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.282518 23368 slave.cpp:1015] Launching task 1 for framework
20140429-190247-1828659978-51791-23344-0000
I0429 19:02:48.284199 23367 external_containerizer.cpp:266] Launching container
'ad9265ce-1161-42c5-b816-4976df51c2ef'
I0429 19:02:48.284247 23368 slave.cpp:1125] Queuing task '1' for executor 1 of
framework '20140429-190247-1828659978-51791-23344-0000
Detaching after fork from child process 23373.
I0429 19:02:48.289022 23350 leveldb.cpp:341] Persisting action (16 bytes) to
leveldb took 8.068729ms
I0429 19:02:48.289068 23350 replica.cpp:664] Persisted action at 4
I0429 19:02:48.289842 23360 replica.cpp:643] Replica received learned notice
for position 4
Detaching after fork from child process 23374.
I0429 19:02:48.297219 23360 leveldb.cpp:341] Persisting action (18 bytes) to
leveldb took 7.275821ms
I0429 19:02:48.297343 23360 leveldb.cpp:399] Deleting ~2 keys from leveldb took
80455ns
I0429 19:02:48.297401 23360 replica.cpp:664] Persisted action at 4
I0429 19:02:48.297437 23360 replica.cpp:649] Replica learned TRUNCATE action at
position 4
Detaching after fork from child process 23375.
Program received signal SIGPIPE, Broken pipe.
[Switching to Thread 0x7fffe8fc4940 (LWP 23367)]
0x00007ffff64cea2b in write () from /lib64/libpthread.so.0
{noformat}
{noformat}
(gdb) bt
#0 0x00007ffff64cea2b in write () from /lib64/libpthread.so.0
#1 0x00007ffff6cbeefb in write (fd=16, message=...) at
../../3rdparty/libprocess/3rdparty/stout/include/stout/os.hpp:321
#2 protobuf::write (fd=16, message=...) at
../../3rdparty/libprocess/3rdparty/stout/include/stout/protobuf.hpp:57
#3 0x00007ffff6d65a29 in
mesos::internal::slave::ExternalContainerizerProcess::invoke (this=Unhandled
dwarf expression opcode 0xf3
) at ../../src/slave/containerizer/external_containerizer.cpp:883
#4 0x00007ffff6d6ba95 in
mesos::internal::slave::ExternalContainerizerProcess::launch
(this=0x7fffe0069080, containerId=..., taskInfo=Unhandled dwarf expression
opcode 0xf3
) at ../../src/slave/containerizer/external_containerizer.cpp:313
#5 0x00007ffff6d7bb66 in operator() (__functor=Unhandled dwarf expression
opcode 0xf3
) at ../../3rdparty/libprocess/include/process/c++11/dispatch.hpp:248
#6 std::_Function_handler<void(process::ProcessBase*), process::dispatch(const
process::PID<T>&, process::Future<T> (T::*)(P0, P1, P2, P3, P4, P5, P6, P7),
A0, A1, A2, A3, A4, A5, A6, A7) [with R = Nothing; T =
mesos::internal::slave::ExternalContainerizerProcess; P0 = const
mesos::ContainerID&; P1 = const Option<mesos::TaskInfo>&; P2 = const
mesos::ExecutorInfo&; P3 = const std::basic_string<char>&; P4 = const
Option<std::basic_string<char> >&; P5 = const mesos::SlaveID&; P6 = const
process::PID<mesos::internal::slave::Slave>&; P7 = bool; A0 =
mesos::ContainerID; A1 = mesos::TaskInfo; A2 = mesos::ExecutorInfo; A3 =
std::basic_string<char>; A4 = Option<std::basic_string<char> >; A5 =
mesos::SlaveID; A6 = process::PID<mesos::internal::slave::Slave>; A7 =
bool]::__lambda28>::_M_invoke(const std::_Any_data &, process::ProcessBase *)
(__functor=Unhandled dwarf expression opcode 0xf3
) at /opt/rh/devtoolset-2/root/usr/include/c++/4.8.1/functional:2071
#7 0x00007ffff6fc1d92 in process::ProcessManager::resume (this=0xc34680,
process=0x7fffe0069240) at ../../../3rdparty/libprocess/src/process.cpp:2623
#8 0x00007ffff6fc208c in process::schedule (arg=Unhandled dwarf expression
opcode 0xf3
) at ../../../3rdparty/libprocess/src/process.cpp:1310
#9 0x00007ffff64c783d in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff522f26d in clone () from /lib64/libc.so.6
{noformat}
{noformat}
$ gcc --version
│[1201][idownes:mesos]$
gcc (GCC) 4.8.1 20130715 (Red Hat 4.8.1-4)
│
Copyright (C) 2013 Free Software Foundation, Inc.
│
This is free software; see the source for copying conditions. There is NO
│
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
{noformat}
> ExternalContainerizerTest.Launch is flaky
> -----------------------------------------
>
> Key: MESOS-1257
> URL: https://issues.apache.org/jira/browse/MESOS-1257
> Project: Mesos
> Issue Type: Bug
> Reporter: Jie Yu
> Assignee: Till Toenshoff
>
> {noformat}
> [----------] 1 test from ExternalContainerizerTest
> [ RUN ] ExternalContainerizerTest.Launch
> 2014-04-28
> 17:04:14,345:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697:
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2014-04-28
> 17:04:17,681:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697:
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2014-04-28
> 17:04:21,018:44634(0x7f94deffd700):ZOO_ERROR@handle_socket_error_msg@1697:
> Socket [127.0.0.1:52535] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> ../../../mesos/src/tests/external_containerizer_test.cpp:202: Failure
> Failed to wait 10secs for status
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)