[ 
https://issues.apache.org/jira/browse/MESOS-2768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905862#comment-14905862
 ] 

Cong Wang commented on MESOS-2768:
----------------------------------

Interesting, as Yan said, the pipe which got closed is the loop->evpipe\[1\]. 
That 'loop' is initialized by evpipe_init() in ev_async_start(), so 
loop->evpipe is either a pipe or an event fd pair. Anyway, the only place I can 
see where loop->evpipe\[1\]  gets "closed" is:

{noformat}
      if (evpipe [1] < 0)
        evpipe [1] = fds [1]; /* first call, set write fd */
      else
        {
          /* on subsequent calls, do not change evpipe [1] */
          /* so that evpipe_write can always rely on its value. */
          /* this branch does not do anything sensible on windows, */
          /* so must not be executed on windows */

          dup2 (fds [1], evpipe [1]);
          close (fds [1]);
        }
{noformat}

So it is not exactly closed, just dup'ed. Therefore I don't think auditing all 
existing ::close() or os::close() in our code base helps anything, for me it 
looks more likely it is a bug in libev, or a race condition, or it gets closed 
implicitly somewhere in libprocess (for example by some exec() since it has 
FD_CLOEXEC set).

> SIGPIPE in process::run_in_event_loop()
> ---------------------------------------
>
>                 Key: MESOS-2768
>                 URL: https://issues.apache.org/jira/browse/MESOS-2768
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.23.0
>            Reporter: Yan Xu
>            Priority: Critical
>
> Observed in production.
> {noformat:title=slave log}
> I0526 12:17:48.027257 51633 slave.cpp:4077] Received a new estimation of the 
> oversubscribable resources 
> W0526 12:17:48.027257 51636 logging.cpp:91] RAW: Received signal SIGPIPE; 
> escalating to SIGABRT
> *** Aborted at 1432642668 (unix time) try "date -d @1432642668" if you are 
> using GNU date ***
> PC: @     0x7fa58c23eb6d raise
> *** SIGABRT (@0xc9a5) received by PID 51621 (TID 0x7fa58224c940) from PID 
> 51621; stack trace: ***
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23eb6d raise
>     @     0x7fa58cc19ba7 mesos::internal::logging::handler()
>     @     0x7fa58c23eca0 (unknown)
>     @     0x7fa58c23da2b __libc_write
>     @     0x7fa58cb57b6f evpipe_write.part.5
>     @     0x7fa58d245070 process::run_in_event_loop<>()
>     @     0x7fa58d2441ba process::EventLoop::delay()
>     @     0x7fa58d1c3c9c process::clock::scheduleTick()
>     @     0x7fa58d1c65b1 process::Clock::timer()
>     @     0x7fa58d23915a process::delay<>()
>     @     0x7fa58d23a740 process::ReaperProcess::wait()
>     @     0x7fa58d21261a process::ProcessManager::resume()
>     @     0x7fa58d2128dc process::schedule()
>     @     0x7fa58c23683d start_thread
>     @     0x7fa58ba28fcd clone
> {noformat}
> {noformat:title=gdb}
> (gdb) bt
> #0  0x00007fa58c23eb6d in raise () from /lib64/libpthread.so.0
> #1  0x00007fa58cc19ba7 in mesos::internal::logging::handler (signal=Unhandled 
> dwarf expression opcode 0xf3
> ) at logging/logging.cpp:92
> #2  <signal handler called>
> #3  0x00007fa58c23da2b in write () from /lib64/libpthread.so.0
> #4  0x00007fa58cb57b6f in evpipe_write (loop=0x7fa58e1e79c0, flag=Unhandled 
> dwarf expression opcode 0xfa
> ) at ev.c:2172
> #5  0x00007fa58d245070 in process::run_in_event_loop<Nothing>(const 
> std::function<process::Future<Nothing>()> &) (f=Unhandled dwarf expression 
> opcode 0xf3
> ) at src/libev.hpp:80
> #6  0x00007fa58d2441ba in process::EventLoop::delay(const Duration &, const 
> std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/libev.cpp:106
> #7  0x00007fa58d1c3c9c in process::clock::scheduleTick (timers=Unhandled 
> dwarf expression opcode 0xf3
> ) at src/clock.cpp:119
> #8  0x00007fa58d1c65b1 in process::Clock::timer(const Duration &, const 
> std::function<void()> &) (duration=Unhandled dwarf expression opcode 0xf3
> ) at src/clock.cpp:254
> #9  0x00007fa58d23915a in process::delay<process::ReaperProcess> 
> (duration=..., pid=Unhandled dwarf expression opcode 0xf3
> ) at ./include/process/delay.hpp:25
> #10 0x00007fa58d23a740 in process::ReaperProcess::wait (this=0x2056920) at 
> src/reap.cpp:93
> #11 0x00007fa58d21261a in process::ProcessManager::resume (this=0x1db8d20, 
> process=0x2056958) at src/process.cpp:2172
> #12 0x00007fa58d2128dc in process::schedule (arg=Unhandled dwarf expression 
> opcode 0xf3
> ) at src/process.cpp:602
> #13 0x00007fa58c23683d in start_thread () from /lib64/libpthread.so.0
> #14 0x00007fa58ba28fcd in clone () from /lib64/libc.so.6
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to