[ https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336423#comment-15336423 ]
Greg Mann commented on MESOS-5629: ---------------------------------- I just did some testing as well - reliably reproduced the segfault before the fix, and was unable to induce it after the fix. LGTM! > Agent segfaults after request to '/files/browse' > ------------------------------------------------ > > Key: MESOS-5629 > URL: https://issues.apache.org/jira/browse/MESOS-5629 > Project: Mesos > Issue Type: Bug > Environment: CentOS 7, Mesos 1.0.0-rc1 with patches > Reporter: Greg Mann > Assignee: Joerg Schad > Priority: Blocker > Labels: authorization, mesosphere, security > Fix For: 1.0.0 > > Attachments: test-browse.py > > > We observed a number of agent segfaults today on an internal testing cluster. > Here is a log excerpt: > {code} > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 > status_update_manager.cpp:392] Received status update acknowledgement (UUID: > e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759-0000 > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 > status_update_manager.cpp:824] Checkpointing ACK for status update > TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759-0000 > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356 > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 > (unix time) try "date -d @1466097149" if you are using GNU date *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received > by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 > process::dispatch<>() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 > _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 > mesos::internal::FilesProcess::authorize() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea > mesos::internal::FilesProcess::browse() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 > std::_Function_handler<>::_M_invoke() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb > _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultEEEEE0_clESC_ENKUlRKNS4_IbEEE1_clESG_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 > process::ProcessManager::resume() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 > start_thread > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main > process exited, code=killed, status=11/SEGV > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service > entered failed state. > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed. > Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff > time over, scheduling restart. > {code} > In every case, the stack trace indicates one of the {{/files/*}} endpoints; I > observed this a number of times coming from {{browse()}}, and twice from > {{read()}}. > The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: > [this|https://reviews.apache.org/r/48563/] and > [this|https://reviews.apache.org/r/48566/], which were done to repair a > different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] > on the master and agent. > Thanks go to [~bmahler] for digging into this a bit and discovering a > possible cause > [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745], > where use of {{defer()}} may be necessary to keep execution in the correct > context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)