[jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Jason Lowe (JIRA) Mon, 11 Dec 2017 20:25:52 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe reassigned MAPREDUCE-7022:
-------------------------------------

    Assignee: Johan Gustavsson

Thanks for the patch!  Comments on the approach:

This is going to have some issues compared to the FileSystem counter approach.  
The directories being checked by the patch are _not_ specific to an attempt.  
If the intent was to cover just the container working directory then that's not 
what this patch does.  It is checking the application directories and thus is 
checking the disk usage _across task attempts_ on a node's disk.  If too many 
task attempts run on a node then this threshold could be triggered, causing the 
entire job to fail, when not all of the data originated from a single task 
attempt.  This isn't necessarily desirable, since the task getting retried on 
another node could run just fine and the job would succeed.

It's pretty confusing to have both mapreduce.task.local-fs.limit.bytes and 
mapreduce.task.local-fs.write-limit.bytes.  Users are likely going to set both 
of these to the same value since they won't understand the difference, or they 
will only set one of them.  As mentioned above this isn't really a per-task 
thing as implemented in the patch, so maybe we should be using 'job' here 
instead of 'task'.  Also it's not clear from the description this is per-disk 
and across tasks rather than per task and across disks as implemented by the 
original write-limit.

This is going to add a disk I/O dependency to every task heartbeat where the 
task attempt needs to touch every disk.  There will be cases where this feature 
will cause tasks to timeout/fail when they didn't before because this will 
repeatedly touch bad/slow disks.  I'm not sure if it makes more sense to have 
this check less often in a background thread rather than inlined with the other 
task reporting logic.

Comments on the code changes:

shouldFf does not seem appropriate on the generic TaskAttemptEvent.  It's 
really only a flag that makes sense on one very specific message type, 
TA_FAILMSG.  It would be cleaner if there was a derived TaskAttemptEvent class 
for task attempt failure events where this would be an appropriate method.

I don't see why failFast needs to be persisted as state in the task.  There's 
only one transition that needs it, FailedTransition, and the code can convey 
the event boolean through to notifyTaskAttemptFailed as an argument to that 
method rather than poking into state and having that method pick it up from 
there.

Looking up a conf value that isn't changing is wasteful (hashmap lookup 
followed by string marshalling).  The code can cache the value in an initialize 
method and use that cached value.

FileUtil.getDU should be used instead of FileUtils.sizeOfDirectory so we don't 
have to add a commons IO dependency to mapreduce-client-core.

TaskAttemptListenerImpl#fatalErrorFF needs an Override decorator.

Nit: Does it make sense to track the largest size when any size above the usage 
is going to make it fail?  Likely there's only one at the point where it's 
going to fail since it's constantly checking, so I'm not seeing the value of 
tracking the largest here.  Not a must-fix, just curious why it would be 
necessary.

Nit: The 'FF' acronym could use a more descriptive name since many won't know 
what 'FF' stands for initially.  Would it make more sense to call it something 
like TaskFailed (e.g.: fatalErrorTaskFailed) to convey this is a fatal error in 
an attempt that causes then entire task to fail?

There are a lot of unnecessary whitespace and reformatting changes in 
TaskAttemptListenerImpl in areas unrelated to what needs to be changed.   That 
makes the patch larger than it needs to be and more difficult to backport.


> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue 
> tasks based on writes to local disk writes. In our environment are we mainly 
> run Hive based jobs we noticed that this counter and the size of the local 
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter 
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local 
> disk, so it didn't help us much. So to extend this feature tasks should 
> monitor local scratchdir size and fail if they pass the limit. In these cases 
> the tasks should not be retried either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Reply via email to