[jira] [Commented] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Johan Gustavsson (JIRA) Sun, 21 Jan 2018 17:25:05 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333767#comment-16333767
 ]


Johan Gustavsson commented on MAPREDUCE-7022:
---------------------------------------------

Thanks for the follow up comments. I noticed that QA bot was running against 
the wrong patch when I submitted 007, I presume the cause of the wrong 
execution was due to it being submitted right at the end of the Jira 
maintenance window.

Thanks for confirming TestUberAM was unrelated, as for TestTaskAttemptReporter 
it ran successfully on my local machine and now after changing the temp dir it 
ran successfully on QA bot too, I'm still not sure why it failed to begin with 
though.
{quote}The following code is not going to be OK since FAILED_TRANSITION is a 
static object shared across all task attempts. Just because one task attempt 
wants to fail fast and kill its task does not mean another task attempt should 
do the same for another task. Task failure does not always equate to job 
failure if the user configured the job as such.
{quote}
Good point, didn't think about the consequences of it being static. I moved it 
back to a state kept in TaskAttemptImpl.
{quote}If the disk checker encounters an IOException, should that be fatal to 
the task attempt? I'm thinking of exceptions that could bubble to the top of 
the disk checker thread, which I assume the YarnUncaughtExceptionHandler will 
field and tear down the attempt. Just wondering out loud if there are classes 
of exceptions besides interrupted that we would want to explicitly log/suppress 
here. Seems like minimally we should catch (and potentially rethrow?) 
IOExceptions so we can log which disk hit the error.

If we cannot start the disk limit check thread, should that be fatal to the 
task attempt? Right now it logs an error but proceeds anyway.
{quote}
I added exception catching for all exceptions in the monitoring thread too, 
that only logs the exception then lets the thread continue. I believe an issue 
with starting or running the monitoring thread should not affect the job itself 
as it's only meant to run in a supporting manner and would have no effect on 
the results (except for when the job is killed early). But if you think I'm 
wrong please let me know.

> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>            Priority: Major
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, 
> MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, 
> MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch, MAPREDUCE-7022.008.patch, 
> MAPREDUCE-7022.009.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue 
> tasks based on writes to local disk writes. In our environment are we mainly 
> run Hive based jobs we noticed that this counter and the size of the local 
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter 
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local 
> disk, so it didn't help us much. So to extend this feature tasks should 
> monitor local scratchdir size and fail if they pass the limit. In these cases 
> the tasks should not be retried either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Reply via email to