[
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16365805#comment-16365805
]
James Peach commented on MESOS-8585:
------------------------------------
Yeh, crashing in this case seems pretty unfortunate. Probably
`createExecutorDirectory` should return an error and we should refactor the
callers to be able to propagate that correctly.
> Agent Crashes When Ask to Start Task with Unknown User
> ------------------------------------------------------
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
> Issue Type: Bug
> Components: agent
> Affects Versions: 1.5.0
> Reporter: Karsten
> Priority: Major
> Attachments: dcos-mesos-slave.service.1.gz,
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent
> crashes and restarts.
>
> {code}
> 783 2018-02-14 14:55:45: I0214 14:55:45.319974 6213 slave.cpp:2542]
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605 6213 paths.cpp:727]
> Creating sandbox
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
> for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131 6213 paths.cpp:735]
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad'
> Failed to create executor directory '/var/lib/mesos/slave/
> 785
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795
> listIbSaIbEEEERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEEEEEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73 (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited,
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel
> agent...
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)