[ 
https://issues.apache.org/jira/browse/MESOS-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13676282#comment-13676282
 ] 

Benjamin Hindman commented on MESOS-479:
----------------------------------------

I've included the test output in the description. The bug looks like it's here:

libprocess: process-isolator(19)@10.35.255.108:49643 terminating due to 
basic_filebuf::underflow error reading the file 

Here's my hunch, this was running on a Linux box and the ProcessIsolator might 
do a proc::status on a pid which might have already terminated in which case 
doing the following could cause a C++ exception to get thrown:

file >> _ >> comm >> state >> ppid >> pgrp >> session >> tty_nr
       >> tpgid >> flags >> minflt >> cminflt >> majflt >> cmajflt
       >> utime >> stime >> cutime >> cstime >> priority >> nice
       >> num_threads >> itrealvalue >> starttime >> vsize >> rss
       >> rsslim >> startcode >> endcode >> startstack >> kstkeip
       >> signal >> blocked >> sigcatch >> wchan >> nswap >> cnswap

We should wrap all of this in a try/catch block or change these to individual 
reads.

It's possible another place a C++ exception could get thrown is within killtree 
since it does an os::shell and passes LOG(INFO) ...
                
> SlaveRecoveryTest/0.CleanupExecutor failure.
> --------------------------------------------
>
>                 Key: MESOS-479
>                 URL: https://issues.apache.org/jira/browse/MESOS-479
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>
> [ RUN      ] SlaveRecoveryTest/0.CleanupExecutor
> Checkpointing SlaveInfo to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/slave.info'
> Checkpointing FrameworkInfo to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.info'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> Checkpointing ExecutorInfo to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/executor.info'
> Checkpointing Task to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/tasks/07360cb3-7b42-44b5-9942-b27802a18224/task.info'
> Checkpointing forked pid 38518
> Checkpointing '38518' to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/pids/forked.pid'
> Fetching resources into 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9'
> Checkpointing 'executor(1)@10.35.255.108:48801' to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/executors/07360cb3-7b42-44b5-9942-b27802a18224/runs/60657969-3cdc-46e3-ba9e-51c8db502ef9/pids/libprocess.pid'
> Registered executor on smfd-atr-11-sr1.devel.twitter.com
> Starting task 07360cb3-7b42-44b5-9942-b27802a18224
> Forked command at 38572
> sh -c 'sleep 1000'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> Checkpointing 'scheduler(84)@10.35.255.108:49643' to 
> '/tmp/SlaveRecoveryTest_0_CleanupExecutor_WDbjOB/meta/slaves/201305220702-1828659978-49643-36613-0/frameworks/201305220702-1828659978-49643-36613-0000/framework.pid'
> libprocess: process-isolator(19)@10.35.255.108:49643 terminating due to 
> basic_filebuf::underflow error reading the file
> Waited on process 38572, returned status 15
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0522 07:02:54.414633 38554 process_utils.hpp:64] Stopping ... 38572
> Group members:
>   PID  PPID  PGID  SESS COMMAND
> 38572 38518 38572 38572 sleep 1000
> Session members:
>   PID  PPID  PGID  SESS COMMAND
> 38572 38518 38572 38572 sleep 1000
> Sent signal to 38572
> GMOCK WARNING:
> Uninteresting mock function call - returning directly.
>     Function call: slaveLost(0x7fff81049190, @0x7fc814001eb0 
> 201305220702-1828659978-49643-36613-0)
> Stack trace:
> ../../src/tests/slave_recovery_tests.cpp:764: Failure
> Value of: status.get().state()
>   Actual: TASK_LOST
> Expected: TASK_FAILED

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to