[ https://issues.apache.org/jira/browse/MESOS-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Toenshoff updated MESOS-5748: ---------------------------------- Comment: was deleted (was: The problem does not seem fixed for me - or maybe it got reintroduced lately. I am hitting this on macOS after around 100 - 150 repetitions (did 3 runs). {noformat} $ ./3rdparty/libprocess/libprocess-tests --gtest_filter="ProcessRemoteLinkTest.RemoteLinkLeak" --gtest_repeat=-1 --gtest_break_on_failure {noformat} {noformat} Repeating all tests (iteration 119) . . . Note: Google Test filter = ProcessRemoteLinkTest.RemoteLinkLeak [ RUN ] ProcessRemoteLinkTest.RemoteLinkLeak (libev) select: Invalid argument *** Aborted at 1490865958 (unix time) try "date -d @1490865958" if you are using GNU date *** PC: @ 0x7fffb7621d42 __pthread_kill *** SIGABRT (@0x7fffb7621d42) received by PID 59260 (TID 0x700009538000) stack trace: *** @ 0x7fffb7702b3a _sigtramp @ 0x7faf310fc080 (unknown) @ 0x7fffb7587420 abort @ 0x109a6b51d ev_syserr @ 0x109a6be16 select_poll @ 0x109a67635 ev_run @ 0x109a21f2b ev_loop() @ 0x109a21e96 process::EventLoop::run() @ 0x1099448bf _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEEEEEPvS5_ @ 0x7fffb770c9af _pthread_body @ 0x7fffb770c8fb _pthread_start @ 0x7fffb770c101 thread_start Abort trap: 6 {noformat} As the stacktrace shows, I was testing this with a libev build.) > Potential segfault in `link` and `send` when linking to a remote process > ------------------------------------------------------------------------ > > Key: MESOS-5748 > URL: https://issues.apache.org/jira/browse/MESOS-5748 > Project: Mesos > Issue Type: Bug > Components: libprocess > Affects Versions: 0.22.0, 0.23.0, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.0 > Reporter: Joseph Wu > Assignee: Joseph Wu > Labels: libprocess, mesosphere > Fix For: 0.27.4, 0.28.3, 1.0.0 > > > There is a race in the SocketManager, between a remote {{link}} and > disconnection of the underlying socket. > We potentially segfault here: > https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1512 > {{\*socket}} dereferences the shared pointer underpinning the {{Socket*}} > object. However, the code above this line actually has ownership of the > pointer: > https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1494-L1499 > If the socket dies during the link, the {{ignore_recv_data}} may delete the > Socket underneath {{link}}: > https://github.com/apache/mesos/blob/215e79f571a989e998488077d713c28c7528926e/3rdparty/libprocess/src/process.cpp#L1399-L1411 > ---- > The same race exists for {{send}}. > This race was discovered while running a new test in repetition: > https://reviews.apache.org/r/49175/ > On OSX, I hit the race consistently every 500-800 repetitions: > {code} > 3rdparty/libprocess/libprocess-tests > --gtest_filter="ProcessRemoteLinkTest.RemoteLink" --gtest_break_on_failure > --gtest_repeat=1000 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)