[
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589560#comment-16589560
]
Benno Evers commented on MESOS-9177:
------------------------------------
I think I identified the issue: The recently committed patch 2f4d9ae0 ("Batch
'/state' requests on master") introduced a new code path that can lead to
mutliple threads iterating over the same `completedTasks` circular_buffer in
parallel.
In theory this is fine, since iteration is read-only and the documentation for
boost::circular_buffer explicitly states that parallel reads are thread-safe as
long as no data is modified.
However, the boost version that is used on the cluster where this segfault was
observed is quite old (1.53), and in that version boost defaults to using
checked debug iterators for iteration. These have a *mutable* pointer member
m_next forming a mutable chain of iterators that is updated without
synchronization whenever a new iterator is created or deleted, making it to
iterate even over const versions of the same circular buffer.
This was fixed in boost 2.5 years ago by the following commit
{code}
commit ea60799f315aa2e861d0e14ca9012950021c2fc6
Author: Andrey Semashev <[email protected]>
Date: Fri Apr 29 00:56:06 2016 +0300
Disable debug implementation by default
The debug implementation is not thread-safe, even if different threads are
using separate iterators for reading elements of the container.
BOOST_CB_DISABLE_DEBUG macro is no longer used, BOOST_CB_ENABLE_DEBUG=1 should
be defined instead to enable debug support.
Fixes https://svn.boost.org/trac/boost/ticket/6277.
{code}
whicj also explains why we could not see the issue locally, since the Boost
version bundled with Mesos already contains the fix.
The issue should disappear by either upgrading the boost version, or by adding
the `BOOST_CB_DISABLE_DEBUG=1` macro to the build process.
> Mesos master segfaults when responding to /state requests.
> ----------------------------------------------------------
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.7.0
> Reporter: Alexander Rukletsov
> Assignee: Benno Evers
> Priority: Blocker
> Labels: mesosphere
>
> {noformat}
> *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8;
> stack trace: ***
> @ 0x7f367e7226d0 (unknown)
> @ 0x7f3681266913
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
> @ 0x7f3681266af0
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
> @ 0x7f36812882d0
> mesos::internal::master::FullFrameworkWriter::operator()()
> @ 0x7f36812889d0
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
> @ 0x7f368121aef0
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
> @ 0x7f3681241be3
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApproversEEEE_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
> @ 0x7f3681242760
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
> @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
> @ 0x7f368215f60e process::http::OK::OK()
> @ 0x7f3681219061
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApproversEEEE_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
> @ 0x7f36812212c0
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApproversEEEE_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
> @ 0x7f36812215ac
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApproversEEEE_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEEEEEclEOS3_
> @ 0x7f36821f3541 process::ProcessBase::consume()
> @ 0x7f3682209fbc process::ProcessManager::resume()
> @ 0x7f368220fa76
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> @ 0x7f367eefc2b0 (unknown)
> @ 0x7f367e71ae25 start_thread
> @ 0x7f367e444bad __clone
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)