[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589281#comment-16589281
 ] 

Benno Evers commented on MESOS-9177:
------------------------------------

As a preliminary update, I managed to narrow down the location of the segfault 
to this lambda inside the FullFrameworkWriter:

{code}
      foreach (const Owned<Task>& task, framework_->completedTasks) {
        // Skip unauthorized tasks.
        if (!approvers_->approved<VIEW_TASK>(*task, framework_->info)) {
          continue;
        }

        writer->element(*task);
      }
{code}

Since the Mesos cluster where this segfault was observed runs with a 
non-standard (and quite low) value of --max_completed_tasks_per_framework=20, I 
tried reproducing the crash by starting a mesos-master built from the same 
commit locally, using the `no-executor-framework` to run many tasks, and 
repeatedly hitting the state endpoint on this master. While I was able to 
overload the JSON renderer of my web browser, I didn't manage to reproduce the 
crash.

Next, I turned to reverse engineering the exact location of the crash, which 
seems to be happening while trying to increase an 
`boost::circular_buffer::iterator` (i.e. the container of 
`Master::Framework::completedTasks`). This indicates that we're probably 
pushing values into this container while simulaneously iterating in another 
thread.

However, I still haven't figured out a theory for how this could happen, or how 
to induce the crash locally, since all mutations seem to be happening on the 
Master actor and thus should not be happening in parallel.

> Mesos master segfaults when responding to /state requests.
> ----------------------------------------------------------
>
>                 Key: MESOS-9177
>                 URL: https://issues.apache.org/jira/browse/MESOS-9177
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>            Reporter: Alexander Rukletsov
>            Assignee: Benno Evers
>            Priority: Blocker
>              Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @     0x7f367e7226d0 (unknown)
>  @     0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @     0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @     0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApproversEEEE_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @     0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0EEEEZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApproversEEEE_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @     0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @     0x7f368215f60e process::http::OK::OK()
>  @     0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApproversEEEE_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @     0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApproversEEEE_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @     0x7f36812215ac 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApproversEEEE_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEEEEEclEOS3_
>  @     0x7f36821f3541 process::ProcessBase::consume()
>  @     0x7f3682209fbc process::ProcessManager::resume()
>  @     0x7f368220fa76 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @     0x7f367eefc2b0 (unknown)
>  @     0x7f367e71ae25 start_thread
>  @     0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to