Greg Mann created MESOS-5629:
--------------------------------
Summary: Agent segfaults after request to '/files/browse'
Key: MESOS-5629
URL: https://issues.apache.org/jira/browse/MESOS-5629
Project: Mesos
Issue Type: Bug
Environment: CentOS 7, Mesos 1.0.0-rc1 with patches
Reporter: Greg Mann
Priority: Blocker
Fix For: 1.0.0
We observed a number of agent segfaults today on an internal testing cluster.
Here is a log excerpt:
{code}
Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830
status_update_manager.cpp:392] Received status update acknowledgement (UUID:
e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task
datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework
6d4248cd-2832-4152-b5d0-defbf36f6759-0000
Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830
status_update_manager.cpp:824] Checkpointing ACK for status update TASK_RUNNING
(UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task
datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework
6d4248cd-2832-4152-b5d0-defbf36f6759-0000
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149
(unix time) try "date -d @1466097149" if you are using GNU date ***
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3
(unknown)
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received
by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 (unknown)
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 (unknown)
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33
process::dispatch<>()
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7
_ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752
mesos::internal::FilesProcess::authorize()
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea
mesos::internal::FilesProcess::browse()
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43
std::_Function_handler<>::_M_invoke()
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb
_ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341
process::ProcessManager::resume()
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 (unknown)
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5
start_thread
Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone
Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main
process exited, code=killed, status=11/SEGV
Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service entered
failed state.
Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff time
over, scheduling restart.
{code}
In every case, the stack trace indicates one of the {{/files/*}} endpoints; I
observed this a number of times coming from {{browse()}}, and twice from
{{read()}}.
Thanks go to [~bmahler] for digging into this a bit and discovering a possible
cause
[here](https://github.com/mesosphere/mesos-private/blob/greg/1.0-w-fixes/src/slave/slave.cpp#L5704-L5712),
where use of {{defer()}} may be necessary to keep execution in the correct
context.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)