[ 
https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944055#comment-14944055
 ] 

James Peach commented on MESOS-2079:
------------------------------------

This can be reproduced with:
{code}
$ ./3rdparty/libprocess/tests --gtest_filter=IOTest.Write --gtest_repeat=1000
{code}

I hacked in some code that checks {{F_GETNOSIGPIPE}} and sets 
{{F_SETNOSIGPIPE}} temporarily. This makes the test reliable on OS X. I think 
that this is a better solution than depending on signal delivery details that 
tend to be quite subtle.

Once thing I noticed is that once I disable {{SIGPIPE}} delivery on the file 
descriptor, we block forever in {{sigwait(2)}} in the subsequent loop that 
attempts to consume the {{SIGPIPE}}. This makes sense since the error was 
delivered by the system call not the signal. I don't know the history of that, 
but I suspect it could be fixed by checking whether {{SIGPIPE}} is pending 
before entering the {{sigwait}}, or by using {{sigtimedwait(2)}} on platforms 
that support it. The latter won't fix OS X, though since OS X does not support 
that system call.

[~bmahler] I can supply a patch if you can shepherd it.

> IO.Write test is flaky on OS X 10.10.
> -------------------------------------
>
>                 Key: MESOS-2079
>                 URL: https://issues.apache.org/jira/browse/MESOS-2079
>             Project: Mesos
>          Issue Type: Task
>          Components: libprocess, technical debt, test
>         Environment: OS X 10.10
> {noformat}
> $ clang++ --version
> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
> Target: x86_64-apple-darwin14.0.0
> Thread model: posix
> {noformat}
>            Reporter: Benjamin Mahler
>              Labels: flaky
>
> [~benjaminhindman]: If I recall correctly, this is related to MESOS-1658. 
> Unfortunately, we don't have a stacktrace for SIGPIPE currently:
> {noformat}
> [ RUN      ] IO.Write
> make[5]: *** [check-local] Broken pipe: 13
> {noformat}
> Running in gdb, seems to always occur here:
> {code}
> Program received signal SIGPIPE, Broken pipe.
> [Switching to process 56827 thread 0x60b]
> 0x00007fff9a011132 in __psynch_cvwait ()
> (gdb) where
> #0  0x00007fff9a011132 in __psynch_cvwait ()
> #1  0x00007fff903e7ea0 in _pthread_cond_wait ()
> #2  0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at 
> gate.hpp:82
> #3  0x0000000100600888 in process::schedule (arg=0x0) at src/process.cpp:1373
> #4  0x00007fff903e72fc in _pthread_body ()
> #5  0x00007fff903e7279 in _pthread_start ()
> #6  0x00007fff903e54b1 in thread_start ()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to