[
https://issues.apache.org/jira/browse/MESOS-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676282#comment-13676282
]
Benjamin Hindman commented on MESOS-479:
----------------------------------------
I've included the test output in the description. The bug looks like it's here:
libprocess: process-isolator(19)@10.35.255.108:49643 terminating due to
basic_filebuf::underflow error reading the file
Here's my hunch, this was running on a Linux box and the ProcessIsolator might
do a proc::status on a pid which might have already terminated in which case
doing the following could cause a C++ exception to get thrown:
file >> _ >> comm >> state >> ppid >> pgrp >> session >> tty_nr
>> tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt
>> utime >> stime >> cutime >> cstime >> priority >> nice
>> num_threads >> itrealvalue >> starttime >> vsize >> rss
>> rsslim >> startcode >> endcode >> startstack >> kstkeip
>> signal >> blocked >> sigcatch >> wchan >> nswap >> cnswap
We should wrap all of this in a try/catch block or change these to individual
reads.
It's possible another place a C++ exception could get thrown is within killtree
since it does an os::shell and passes LOG(INFO) ...
> SlaveRecoveryTest/0.CleanupExecutor failure.
> --------------------------------------------
>
> Key: MESOS-479
> URL: https://issues.apache.org/jira/browse/MESOS-479
> Project: Mesos
> Issue Type: Bug
> Reporter: Benjamin Hindman
>
> [ RUN ] SlaveRecoveryTest/0.CleanupExecutor
> Checkpointing SlaveInfo to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/slave.info'
> Checkpointing FrameworkInfo to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.info'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> Checkpointing ExecutorInfo to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/executor.info'
> Checkpointing Task to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/tasks/07360cb3-7b42-44b5-9942-b27802a18224/task.info'
> Checkpointing forked pid 38518
> Checkpointing '38518' to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/pids/forked.pid'
> Fetching resources into
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9'
> Checkpointing 'executor(1)@10.35.255.108:48801' to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/pids/libprocess.pid'
> Registered executor on smfd-atr-11-sr1.devel.twitter.com
> Starting task 07360cb3-7b42-44b5-9942-b27802a18224
> Forked command at 38572
> sh -c 'sleep 1000'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> libprocess: process-isolator(19)@10.35.255.108:49643 terminating due to
> basic_filebuf::underflow error reading the file
> Waited on process 38572, returned status 15
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0522 07:02:54.414633 38554 process_utils.hpp:64] Stopping ... 38572
> Group members:
> PID PPID PGID SESS COMMAND
> 38572 38518 38572 38572 sleep 1000
> Session members:
> PID PPID PGID SESS COMMAND
> 38572 38518 38572 38572 sleep 1000
> Sent signal to 38572
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
> Function call: slaveLost(0x7fff81049190, @0x7fc814001eb0
> 201305220702-1828659978-49643-36613-0)
> Stack trace:
> ../../src/tests/slave_recovery_tests.cpp:764: Failure
> Value of: status.get().state()
> Actual: TASK_LOST
> Expected: TASK_FAILED
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira