browse'

Greg Mann (JIRA) Thu, 16 Jun 2016 21:38:37 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335383#comment-15335383
 ]


Greg Mann edited comment on MESOS-5629 at 6/17/16 4:37 AM:
-----------------------------------------------------------

I was able to reproduce this on my local machine; the segfault occurs when a 
request for a particular path comes in to {{/files/browse}} *while* that path 
is garbage-collected by the agent. This leads to {{FilesProcess::authorize()}} 
attempting to call an authorization callback which has just been removed from 
its {{authorizations}} map. We could avoid such races by changing 
[this|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745]
 and 
[this|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5939-L5946]
 to be deferred dispatches to the {{FilesProcess}} (i.e., {{authorize = 
defer(files, []() ... )}}).

To reproduce, I:
1) Brought the hard disk usage to over 90%, to force GC
2) Launched a master and agent
3) Ran the test-framework once
4) Hit 'agent.host/files/debug' to get a valid path
5) Ran the attached script, 'test-browse.py', to rapidly send requests to 
'agent.host/files/browse' for the valid path
6) While 'test-browse.py' is running, repeatedly run the test-framework until 
the sandbox from its first run is GC'd. When this happens, the segfault will 
likely occur.

Here's the stack trace I generated:
{code}
E0616 20:58:36.976708 183517184 process.cpp:2040] Failed to shutdown socket 
with fd 18: Socket is not connected
I0616 20:58:36.976701 179761152 slave.cpp:3783] executor(1)@127.0.0.1:51568 
exited
E0616 20:58:36.976786 183517184 process.cpp:2040] Failed to shutdown socket 
with fd 10: Socket is not connected
E0616 20:58:36.982058 183517184 process.cpp:2040] Failed to shutdown socket 
with fd 13: Socket is not connected
E0616 20:58:36.985937 183517184 process.cpp:2040] Failed to shutdown socket 
with fd 11: Socket is not connected
*** Aborted at 1466135916 (unix time) try "date -d @1466135916" if you are 
using GNU date ***
PC: @        0x102658c48 process::PID<>::PID()
*** SIGSEGV (@0x0) received by PID 21788 (TID 0x10adfe000) stack trace: ***
    @     0x7fff90b6df1a _sigtramp
    @        0x10203c8ad std::__1::char_traits<>::compare()
    @        0x102658bcd process::PID<>::PID()
    @        0x1025de5eb process::Process<>::self()
    @        0x1033b6ecd process::dispatch<>()
    @        0x10323f7f0 
mesos::internal::slave::Framework::launchExecutor()::$_12::operator()()
    @        0x10323f714 
_ZNSt3__128__invoke_void_return_wrapperIN7process6FutureIbEEE6__callIJRZN5mesos8internal5slave9Framework14launchExecutorERKNS6_12ExecutorInfoERKNS6_8TaskInfoEE4$_12RK6OptionINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEEEEEES3_DpOT_
    @        0x10323f4a3 std::__1::__function::__func<>::operator()()
    @        0x1024dc667 std::__1::function<>::operator()()
    @        0x1024c365a mesos::internal::FilesProcess::authorize()
    @        0x1024bfe70 mesos::internal::FilesProcess::browse()
    @        0x1024eb297 
_ZNSt3__128__invoke_void_return_wrapperIN7process6FutureINS1_4http8ResponseEEEE6__callIJRNS_6__bindIRMN5mesos8internal12FilesProcessEFS5_RKNS3_7RequestERK6OptionINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEEEEJPSB_RNS_12placeholders4__phILi1EEERNSU_ILi2EEEEEESE_SO_EEES5_DpOT_
    @        0x1024eae13 
_ZNSt3__110__function6__funcINS_6__bindIRMN5mesos8internal12FilesProcessEFN7process6FutureINS6_4http8ResponseEEERKNS8_7RequestERK6OptionINS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEEEEJPS5_RNS_12placeholders4__phILi1EEERNST_ILi2EEEEEENSI_ISY_EEFSA_SD_SN_EEclESD_SN_
    @        0x104cc4342 std::__1::function<>::operator()()
    @        0x104beb674 
_ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENK3$_3clERKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEENKUlRKNS5_IbEEE_clESG_
    @        0x104beeae1 
_ZZZNK7process9_DeferredIZZNS_11ProcessBase5visitERKNS_9HttpEventEENK3$_3clERKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEEUlRKNS6_IbEEE_EcvNSt3__18functionIFvT_EEEISH_EEvENKUlSH_E_clESH_ENKUlvE_clEv
    @        0x104beeaad 
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZZNK7process9_DeferredIZZNS3_11ProcessBase5visitERKNS3_9HttpEventEENK3$_3clERKNS3_6FutureI6OptionINS3_4http14authentication20AuthenticationResultEEEEEUlRKNSA_IbEEE_EcvNS_8functionIFvT_EEEISL_EEvENKUlSL_E_clESL_EUlvE_EEEvDpOT_
    @        0x104bee7ec 
_ZNSt3__110__function6__funcIZZNK7process9_DeferredIZZNS2_11ProcessBase5visitERKNS2_9HttpEventEENK3$_3clERKNS2_6FutureI6OptionINS2_4http14authentication20AuthenticationResultEEEEEUlRKNS9_IbEEE_EcvNS_8functionIFvT_EEEISK_EEvENKUlSK_E_clESK_EUlvE_NS_9allocatorIST_EEFvvEEclEv
    @        0x10208a7d1 std::__1::function<>::operator()()
    @        0x1020ea359 
_ZZN7process8dispatchERKNS_4UPIDERKNSt3__18functionIFvvEEEENKUlPNS_11ProcessBaseEE_clESA_
    @        0x1020ea330 
_ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRZN7process8dispatchERKNS3_4UPIDERKNS_8functionIFvvEEEEUlPNS3_11ProcessBaseEE_SD_EEEvDpOT_
    @        0x1020ea0cc 
_ZNSt3__110__function6__funcIZN7process8dispatchERKNS2_4UPIDERKNS_8functionIFvvEEEEUlPNS2_11ProcessBaseEE_NS_9allocatorISD_EEFvSC_EEclEOSC_
    @        0x104c0837f std::__1::function<>::operator()()
    @        0x104bd9bbf process::ProcessBase::visit()
    @        0x104c3b1ae process::DispatchEvent::visit()
    @        0x10206de51 process::ProcessBase::serve()
    @        0x104bd68b1 process::ProcessManager::resume()
    @        0x104be2086 
process::ProcessManager::init_threads()::$_1::operator()()
    @        0x104be1d29 
_ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_1EEEEEPvS6_
    @     0x7fff8e76605a _pthread_body
    @     0x7fff8e765fd7 _pthread_start
    @     0x7fff8e7633ed thread_start
{code}


was (Author: greggomann):
I was able to reproduce this on my local machine; the segfault occurs when a 
request for a particular path comes in to {{/files/browse}} *while* that path 
is garbage-collected by the agent. This leads to {{FilesProcess::authorize()}} 
attempting to call an authorization callback which has just been removed from 
its {{authorizations}} map. We could avoid such races by changing 
[this|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745]
 and 
[this|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5939-L5946]
 to be deferred dispatches to the {{FilesProcess}} (i.e., {{authorize = 
defer(files, []() ... )}}).

> Agent segfaults after request to '/files/browse'
> ------------------------------------------------
>
>                 Key: MESOS-5629
>                 URL: https://issues.apache.org/jira/browse/MESOS-5629
>             Project: Mesos
>          Issue Type: Bug
>         Environment: CentOS 7, Mesos 1.0.0-rc1 with patches
>            Reporter: Greg Mann
>            Assignee: Greg Mann
>            Priority: Blocker
>              Labels: authorization, mesosphere, security
>             Fix For: 1.0.0
>
>
> We observed a number of agent segfaults today on an internal testing cluster. 
> Here is a log excerpt:
> {code}
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 
> status_update_manager.cpp:392] Received status update acknowledgement (UUID: 
> e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-0000
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 
> status_update_manager.cpp:824] Checkpointing ACK for status update 
> TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-0000
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 
> (unix time) try "date -d @1466097149" if you are using GNU date ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @     0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received 
> by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6431100 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7eced33 
> process::dispatch<>()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7e7aad7 
> _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd1752 
> mesos::internal::FilesProcess::authorize()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd1bea 
> mesos::internal::FilesProcess::browse()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d7bd6e43 
> std::_Function_handler<>::_M_invoke()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d85478cb 
> _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d8551341 
> process::ProcessManager::resume()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d8551647 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6909220 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d6429dc5 
> start_thread
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @     0x7ff4d615728d __clone
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main 
> process exited, code=killed, status=11/SEGV
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service 
> entered failed state.
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
> Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff 
> time over, scheduling restart.
> {code}
> In every case, the stack trace indicates one of the {{/files/*}} endpoints; I 
> observed this a number of times coming from {{browse()}}, and twice from 
> {{read()}}.
> The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: 
> [this|https://reviews.apache.org/r/48563/] and 
> [this|https://reviews.apache.org/r/48566/], which were done to repair a 
> different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] 
> on the master.
> Thanks go to [~bmahler] for digging into this a bit and discovering a 
> possible cause 
> [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745],
>  where use of {{defer()}} may be necessary to keep execution in the correct 
> context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-5629) Agent segfaults after request to '/files/browse'

Reply via email to