[ 
https://issues.apache.org/jira/browse/MESOS-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952496#comment-14952496
 ] 

Benjamin Mahler edited comment on MESOS-2768 at 10/12/15 7:17 PM:
------------------------------------------------------------------

Ok, so having ruled out double closes as the culprit, I spent some time digging 
into libev with [~chzhcn] late last week and found the bug. [~jieyu] helped me 
validate this by injecting sleeps into libev to be able to trigger the bug 
deterministically from the tests. Thanks guys!

Note that this issue manifests on older versions of Linux when the eventfd 
headers are not available, which is true for CentOS 5. Sent an email to the 
libev mailing list which confirmed it here: 
http://lists.schmorp.de/pipermail/libev/2015q4/thread.html

Seems there are couple of options:

*(1)* Wait for the release which includes the fix, may take some time.

*(2)* Update our patch file to include the fix. This can be done quickly as a 
stop-gap but will not apply to those that use an unbundled libev.

*(3)* Update libprocess to ignore SIGPIPE temporarily when using ev_async_send. 
This seems undesirable due to it being a hot path and it introduces yet another 
block that temporarily ignores SIGPIPE.

*(4)* Update libprocess to ignore SIGPIPE process-wide and document this so 
that users of libprocess understand that EPIPE must be handled. In retrospect 
this seems like the right long-term decision, since we've had to inject several 
SIGPIPE ignoring blocks and OS X still has quirks. Not to mention that SIGPIPE 
is unnecessary and is meant primarily for shell filter like programs.

The fix is these three diffs, tested locally that it fixes the issue:
http://cvs.schmorp.de/libev/ev.c?r1=1.477&r2=1.478
http://cvs.schmorp.de/libev/ev_epoll.c?r1=1.68&r2=1.69
http://cvs.schmorp.de/libev/ev_epoll.c?r1=1.69&r2=1.70


was (Author: bmahler):
Ok, so having ruled out double closes as the culprit, I spent some time digging 
into libev with [~chzhcn] late last week and found the bug. [~jieyu] helped me 
validate this by injecting sleeps into libev to be able to trigger the bug 
deterministically from the tests. Thanks guys!

Note that this issue manifests on older versions of Linux when the eventfd 
headers are not available, which is true for CentOS 5. Sent an email to the 
libev mailing list which confirmed it here: 
http://lists.schmorp.de/pipermail/libev/2015q4/thread.html

Seems there are couple of options:

*(1)* Wait for the release which includes the fix, may take some time.

*(2)* Update our patch file to include the fix. This can be done quickly as a 
stop-gap but will not apply to those that use an unbundled libev.

*(3)* Update libprocess to ignore SIGPIPE temporarily when using ev_async_send. 
This seems undesirable due to it being a hot path and it introduces yet another 
block that temporarily ignores SIGPIPE.

*(4)* Update libprocess to ignore SIGPIPE process-wide and document this so 
that users of libprocess understand that EPIPE must be handled. In retrospect 
this seems like the right long-term decision, since we've had to inject several 
SIGPIPE ignoring blocks and OS X still has quirks. Not to mention that SIGPIPE 
is unnecessary and is meant primarily for shell filter like programs.

The (untested) fix is these two diffs:
http://cvs.schmorp.de/libev/ev.c?r1=1.477&r2=1.478
http://cvs.schmorp.de/libev/ev_epoll.c?r1=1.68&r2=1.69

> SIGPIPE in process::run_in_event_loop()
> ---------------------------------------
>
>                 Key: MESOS-2768
>                 URL: https://issues.apache.org/jira/browse/MESOS-2768
>             Project: Mesos
>          Issue Type: Bug
>          Components: libprocess
>    Affects Versions: 0.23.0
>         Environment: CentOS 5
>            Reporter: Yan Xu
>            Priority: Critical
>
> Observed in production.
> {noformat:title=slave log}
> I0526 12:17:48.027257 51633 slave.cpp:4077] Received a new estimation of the 
> oversubscribable resources 
> W0526 12:17:48.027257 51636 logging.cpp:91] RAW: Received signal SIGPIPE; 
> escalating to SIGABRT
> *** Aborted at 1432642668 (unix time) try "date -d @1432642668" if you are 
> using GNU date ***
> PC: @     0x7fa58c23eb6d raise
> *** SIGABRT (@0xc9a5) received by PID 51621 (TID 0x7fa58224c940) from PID 
> 51621; stack trace: ***
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23eb6d raise
>     @     0x7fa58cc19ba7 mesos::internal::logging::handler()
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23da2b __libc_write
>     @     0x7fa58cb57b6f evpipe_write.part.5
>     @     0x7fa58d245070 process::run_in_event_loop<>()
>     @     0x7fa58d2441ba process::EventLoop::delay()
>     @     0x7fa58d1c3c9c process::clock::scheduleTick()
>     @     0x7fa58d1c65b1 process::Clock::timer()
>     @     0x7fa58d23915a process::delay<>()
>     @     0x7fa58d23a740 process::ReaperProcess::wait()
>     @     0x7fa58d21261a process::ProcessManager::resume()
>     @     0x7fa58d2128dc process::schedule()
>     @     0x7fa58c23683d start_thread
>     @     0x7fa58ba28fcd clone
> {noformat}
> {noformat:title=gdb}
> (gdb) bt
> #0  0x00007fa58c23eb6d in raise () from /lib64/libpthread.so.0
> #1  0x00007fa58cc19ba7 in mesos::internal::logging::handler (signal=Unhandled 
> dwarf expression opcode 0xf3
> ) at logging/logging.cpp:92
> #2  <signal handler called>
> #3  0x00007fa58c23da2b in write () from /lib64/libpthread.so.0
> #4  0x00007fa58cb57b6f in evpipe_write (loop=0x7fa58e1e79c0, flag=Unhandled 
> dwarf expression opcode 0xfa
> ) at ev.c:2172
> #5  0x00007fa58d245070 in process::run_in_event_loop<Nothing>(const 
> std::function<process::Future<Nothing>()> &) (f=Unhandled dwarf expression 
> opcode 0xf3
> ) at src/libev.hpp:80
> #6  0x00007fa58d2441ba in process::EventLoop::delay(const Duration &, const 
> std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/libev.cpp:106
> #7  0x00007fa58d1c3c9c in process::clock::scheduleTick (timers=Unhandled 
> dwarf expression opcode 0xf3
> ) at src/clock.cpp:119
> #8  0x00007fa58d1c65b1 in process::Clock::timer(const Duration &, const 
> std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/clock.cpp:254
> #9  0x00007fa58d23915a in process::delay<process::ReaperProcess> 
> (duration=..., pid=Unhandled dwarf expression opcode 0xf3
> ) at ./include/process/delay.hpp:25
> #10 0x00007fa58d23a740 in process::ReaperProcess::wait (this=0x2056920) at 
> src/reap.cpp:93
> #11 0x00007fa58d21261a in process::ProcessManager::resume (this=0x1db8d20, 
> process=0x2056958) at src/process.cpp:2172
> #12 0x00007fa58d2128dc in process::schedule (arg=Unhandled dwarf expression 
> opcode 0xf3
> ) at src/process.cpp:602
> #13 0x00007fa58c23683d in start_thread () from /lib64/libpthread.so.0
> #14 0x00007fa58ba28fcd in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to