[ 
https://issues.apache.org/jira/browse/MESOS-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832489#comment-15832489
 ] 

Greg Mann commented on MESOS-6937:
----------------------------------

After debugging this with [~mcypark], we seem to have tracked down the race 
leading to this failure. While it is clearly indicated in 
{{process::internal::Loop}} that we intend to synchronize all accesses to the 
{{discard}} member (see [this 
comment|https://github.com/apache/mesos/blob/c159516dbea0fde9c0335844dd4bac685f9dad0e/3rdparty/libprocess/include/process/loop.hpp#L286-L291]),
 one assignment to the member remains 
[unsynchronized|https://github.com/apache/mesos/blob/c159516dbea0fde9c0335844dd4bac685f9dad0e/3rdparty/libprocess/include/process/loop.hpp#L328].

A local reproduction of this bug has not been possible thus far, unfortunately. 
However, MPark has been seeing it occur frequently during his testing of the 
ASF Mesos CI. To test the fix, he ran multiple builds using a branch containing 
the fix and verified that the "double free or corruption" error did not appear. 
While this is not the firm verification we would like, it's the best I've been 
able to produce thus far, and it seems like sufficient verification to merge 
the fix in order to resolve this issue.

cc [~anandmazumdar]

> ContentType/MasterAPITest.ReserveResources/1 fails during Writer close
> ----------------------------------------------------------------------
>
>                 Key: MESOS-6937
>                 URL: https://issues.apache.org/jira/browse/MESOS-6937
>             Project: Mesos
>          Issue Type: Bug
>          Components: tests
>         Environment: ASF CI, Ubuntu 14.04, libevent and SSL enabled
>            Reporter: Greg Mann
>            Assignee: Greg Mann
>            Priority: Blocker
>              Labels: tests
>         Attachments: MasterAPITest.ReserveResources.txt
>
>
> This was observed on ASF CI. Libevent was enabled, but the test in question 
> was not running in SSL-enabled mode. We see the following stack trace:
> {code}
> *** Error in `src/mesos-tests': double free or corruption (fasttop): 
> 0x00002b4f7001bf70 ***
> *** Aborted at 1484691168 (unix time) try "date -d @1484691168" if you are 
> using GNU date ***
> PC: @     0x2b4f2bc9ac37 (unknown)
> *** SIGABRT (@0x3e8000069c7) received by PID 27079 (TID 0x2b4f35be5700) from 
> PID 27079; stack trace: ***
>     @     0x2b4f2b236330 (unknown)
>     @     0x2b4f2bc9ac37 (unknown)
>     @     0x2b4f2bc9e028 (unknown)
>     @     0x2b4f2bcd72a4 (unknown)
>     @     0x2b4f2bce355e (unknown)
>     @     0x2b4f299e98a0 
> _ZNSt14_Function_base13_Base_managerIZN7process8internal4LoopIZNS1_4http4Pipe6Reader7readAllEvEUlvE_ZNS6_7readAllEvEUlRKSsE0_SsSsE3runENS1_6FutureISsEEEUlvE3_E10_M_managerERSt9_Any_dataRKSG_St18_Manager_operation
>     @     0x2b4f299fadb9 
> _ZN7process8internal4LoopIZNS_4http4Pipe6Reader7readAllEvEUlvE_ZNS4_7readAllEvEUlRKSsE0_SsSsE3runENS_6FutureISsEE
>     @     0x2b4f299fca57 
> _ZNSt17_Function_handlerIFvRKN7process6FutureISsEEEZNKS2_5onAnyIRZNS0_8internal4LoopIZNS0_4http4Pipe6Reader7readAllEvEUlvE_ZNSB_7readAllEvEUlRKSsE0_SsSsE3runES2_EUlS4_E2_vEES4_OT_NS2_6PreferEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
>     @     0x2b4f28a4cc16 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureISsEEEEJRS4_EEEvRKSt6vectorIT_SaISB_EEDpOT0_
>     @     0x2b4f29a2479f process::Future<>::_set<>()
>     @     0x2b4f299f46a9 process::http::Pipe::Writer::close()
>     @     0x2b4f29a24d32 
> process::StreamingRequestDecoder::on_message_complete()
>     @     0x2b4f29b0641d http_parser_execute
>     @     0x2b4f29aaeafe process::internal::decode_recv()
>     @     0x2b4f29abc44b 
> _ZNSt17_Function_handlerIFvRKN7process6FutureImEEEZNKS2_5onAnyISt5_BindIFPFvS4_PcmNS0_7network8internal6SocketINS9_4inet7AddressEEEPNS0_23StreamingRequestDecoderEESt12_PlaceholderILi1EES8_mSE_SG_EEvEES4_OT_NS2_6PreferEEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
>     @          0x14e136e process::internal::run<>()
>     @          0x14e5d9f process::Future<>::_set<>()
>     @     0x2b4f29a4c23d 
> _ZN7process8internal4LoopIZNS_2io8internal4readEiPvmEUlvE_ZNS3_4readEiS4_mEUlRK6OptionImEE0_S7_mE3runENS_6FutureIS7_EE
>     @     0x2b4f29a4dc6f 
> _ZNSt17_Function_handlerIFvRKN7process6FutureINS0_11ControlFlowImEEEEEZNKS4_5onAnyIRZNS0_8internal4LoopIZNS0_2io8internal4readEiPvmEUlvE_ZNSC_4readEiSD_mEUlRK6OptionImEE0_SG_mE3runENS1_ISG_EEEUlS6_E0_vEES6_OT_NS4_6PreferEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
>     @     0x2b4f29a5bec6 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureINS_11ControlFlowImEEEEEEJRS6_EEEvRKSt6vectorIT_SaISD_EEDpOT0_
>     @     0x2b4f29a5d971 process::Future<>::_set<>()
>     @     0x2b4f29a600a1 process::Promise<>::associate()
>     @     0x2b4f29a608da process::internal::thenf<>()
>     @     0x2b4f29b0170e 
> _ZN7process8internal3runISt8functionIFvRKNS_6FutureIsEEEEJRS4_EEEvRKSt6vectorIT_SaISB_EEDpOT0_
>     @     0x2b4f29b01cd1 process::Future<>::_set<>()
>     @     0x2b4f29b00b36 process::io::internal::pollCallback()
>     @     0x2b4f29b0b990 event_process_active_single_queue
>     @     0x2b4f29b0bf06 event_process_active
>     @     0x2b4f29b0c662 event_base_loop
>     @     0x2b4f29aff96d process::EventLoop::run()
>     @     0x2b4f2b4f5a60 (unknown)
>     @     0x2b4f2b22e184 start_thread
> {code}
> Find the log from the failed run attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to