Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Santhosh Kumar Shanmugham Wed, 18 Jul 2018 14:39:46 -0700


> On July 18, 2018, 2:21 p.m., Stephan Erb wrote:
> > src/main/python/apache/aurora/executor/aurora_executor.py
> > Lines 159 (patched)
> > <https://reviews.apache.org/r/67967/diff/1/?file=2061542#file2061542line159>
> >
> >     Should we use TASK_LOST here instead? Most users interpret TASK_FAILED 
> > as their responsibility whereas TASK_LOST is more of a misshap of 
> > Aurora/Mesos/Thermos. I would think an unknown exception in the runner is 
> > part of the latter category.
> 
> Santhosh Kumar Shanmugham wrote:
>     Hmm. Then we can argue that failure to create sandbox or fork the process 
> etc also should be treated as TASK_LOST? At Twitter this is really not going 
> to help us, since we have platform wrapper that cause TASK_FAILED and it is 
> already hard to differentiate user configuration failures against platform 
> dependency failures.
>     
>     I wanted to keep this consistent with the rest. If TASK_LOST makes more 
> sense for you I can update it.
> 
> Stephan Erb wrote:
>     You make a good point. Let's keep it at FAILED. If really necessary, I 
> could always come back later with a more complete proposal.


I think the differentiation of user vs platform failure needs a whole lot of 
clean up in the executor codebase. We have been putting this work off for 
sometime but we are starting to realize that we need this data to make better 
decisions.


- Santhosh Kumar


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67967/#review206209
-----------------------------------------------------------


On July 18, 2018, 1:27 p.m., Santhosh Kumar Shanmugham wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67967/
> -----------------------------------------------------------
> 
> (Updated July 18, 2018, 1:27 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin, Jordan Ly, Reza Motamedi, and 
> Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> If the ThermoTaskRunner encounters an Exception when trying to
> fork the process, it bubbles this up to the Executor which does
> not handle execptions other than TaskError. This leads to the
> executor leaving the task in STARTING state and we end up with
> tasks that get stranded in this state.
> 
> Fix it so that any unknown expection that is thrown when starting
> a runner leads to task failure and get marked as FAILED.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/aurora_executor.py 
> 8a9958fffc2312686dccc7daf6d216631d4c956e 
>   src/test/python/apache/aurora/executor/test_thermos_executor.py 
> f6ae1be5d56bfd845bd09db67ef0000a92091136 
> 
> 
> Diff: https://reviews.apache.org/r/67967/diff/1/
> 
> 
> Testing
> -------
> 
> ./gradlew test
> ./pants test src/test/python/apache::
> 
> 
> Thanks,
> 
> Santhosh Kumar Shanmugham
> 
>

Re: Review Request 67967: Unhandled exception should not strand runner in STARTING state.

Reply via email to