[
https://issues.apache.org/jira/browse/MESOS-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944055#comment-14944055
]
James Peach commented on MESOS-2079:
------------------------------------
This can be reproduced with:
{code}
$ ./3rdparty/libprocess/tests --gtest_filter=IOTest.Write --gtest_repeat=1000
{code}
I hacked in some code that checks {{F_GETNOSIGPIPE}} and sets
{{F_SETNOSIGPIPE}} temporarily. This makes the test reliable on OS X. I think
that this is a better solution than depending on signal delivery details that
tend to be quite subtle.
Once thing I noticed is that once I disable {{SIGPIPE}} delivery on the file
descriptor, we block forever in {{sigwait(2)}} in the subsequent loop that
attempts to consume the {{SIGPIPE}}. This makes sense since the error was
delivered by the system call not the signal. I don't know the history of that,
but I suspect it could be fixed by checking whether {{SIGPIPE}} is pending
before entering the {{sigwait}}, or by using {{sigtimedwait(2)}} on platforms
that support it. The latter won't fix OS X, though since OS X does not support
that system call.
[~bmahler] I can supply a patch if you can shepherd it.
> IO.Write test is flaky on OS X 10.10.
> -------------------------------------
>
> Key: MESOS-2079
> URL: https://issues.apache.org/jira/browse/MESOS-2079
> Project: Mesos
> Issue Type: Task
> Components: libprocess, technical debt, test
> Environment: OS X 10.10
> {noformat}
> $ clang++ --version
> Apple LLVM version 6.0 (clang-600.0.54) (based on LLVM 3.5svn)
> Target: x86_64-apple-darwin14.0.0
> Thread model: posix
> {noformat}
> Reporter: Benjamin Mahler
> Labels: flaky
>
> [~benjaminhindman]: If I recall correctly, this is related to MESOS-1658.
> Unfortunately, we don't have a stacktrace for SIGPIPE currently:
> {noformat}
> [ RUN ] IO.Write
> make[5]: *** [check-local] Broken pipe: 13
> {noformat}
> Running in gdb, seems to always occur here:
> {code}
> Program received signal SIGPIPE, Broken pipe.
> [Switching to process 56827 thread 0x60b]
> 0x00007fff9a011132 in __psynch_cvwait ()
> (gdb) where
> #0 0x00007fff9a011132 in __psynch_cvwait ()
> #1 0x00007fff903e7ea0 in _pthread_cond_wait ()
> #2 0x000000010062f27c in Gate::arrive (this=0x101908a10, old=14780) at
> gate.hpp:82
> #3 0x0000000100600888 in process::schedule (arg=0x0) at src/process.cpp:1373
> #4 0x00007fff903e72fc in _pthread_body ()
> #5 0x00007fff903e7279 in _pthread_start ()
> #6 0x00007fff903e54b1 in thread_start ()
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)