[
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333767#comment-16333767
]
Johan Gustavsson commented on MAPREDUCE-7022:
---------------------------------------------
Thanks for the follow up comments. I noticed that QA bot was running against
the wrong patch when I submitted 007, I presume the cause of the wrong
execution was due to it being submitted right at the end of the Jira
maintenance window.
Thanks for confirming TestUberAM was unrelated, as for TestTaskAttemptReporter
it ran successfully on my local machine and now after changing the temp dir it
ran successfully on QA bot too, I'm still not sure why it failed to begin with
though.
{quote}The following code is not going to be OK since FAILED_TRANSITION is a
static object shared across all task attempts. Just because one task attempt
wants to fail fast and kill its task does not mean another task attempt should
do the same for another task. Task failure does not always equate to job
failure if the user configured the job as such.
{quote}
Good point, didn't think about the consequences of it being static. I moved it
back to a state kept in TaskAttemptImpl.
{quote}If the disk checker encounters an IOException, should that be fatal to
the task attempt? I'm thinking of exceptions that could bubble to the top of
the disk checker thread, which I assume the YarnUncaughtExceptionHandler will
field and tear down the attempt. Just wondering out loud if there are classes
of exceptions besides interrupted that we would want to explicitly log/suppress
here. Seems like minimally we should catch (and potentially rethrow?)
IOExceptions so we can log which disk hit the error.
If we cannot start the disk limit check thread, should that be fatal to the
task attempt? Right now it logs an error but proceeds anyway.
{quote}
I added exception catching for all exceptions in the monitoring thread too,
that only logs the exception then lets the thread continue. I believe an issue
with starting or running the monitoring thread should not affect the job itself
as it's only meant to run in a supporting manner and would have no effect on
the results (except for when the job is killed early). But if you think I'm
wrong please let me know.
> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
> Key: MAPREDUCE-7022
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 2.7.0, 2.8.0, 2.9.0
> Reporter: Johan Gustavsson
> Assignee: Johan Gustavsson
> Priority: Major
> Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch,
> MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch,
> MAPREDUCE-7022.006.patch, MAPREDUCE-7022.007.patch, MAPREDUCE-7022.008.patch,
> MAPREDUCE-7022.009.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue
> tasks based on writes to local disk writes. In our environment are we mainly
> run Hive based jobs we noticed that this counter and the size of the local
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local
> disk, so it didn't help us much. So to extend this feature tasks should
> monitor local scratchdir size and fail if they pass the limit. In these cases
> the tasks should not be retried either but instead the job should fast fail.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]