[
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe reassigned MAPREDUCE-7022:
-------------------------------------
Assignee: Johan Gustavsson
Thanks for the patch! Comments on the approach:
This is going to have some issues compared to the FileSystem counter approach.
The directories being checked by the patch are _not_ specific to an attempt.
If the intent was to cover just the container working directory then that's not
what this patch does. It is checking the application directories and thus is
checking the disk usage _across task attempts_ on a node's disk. If too many
task attempts run on a node then this threshold could be triggered, causing the
entire job to fail, when not all of the data originated from a single task
attempt. This isn't necessarily desirable, since the task getting retried on
another node could run just fine and the job would succeed.
It's pretty confusing to have both mapreduce.task.local-fs.limit.bytes and
mapreduce.task.local-fs.write-limit.bytes. Users are likely going to set both
of these to the same value since they won't understand the difference, or they
will only set one of them. As mentioned above this isn't really a per-task
thing as implemented in the patch, so maybe we should be using 'job' here
instead of 'task'. Also it's not clear from the description this is per-disk
and across tasks rather than per task and across disks as implemented by the
original write-limit.
This is going to add a disk I/O dependency to every task heartbeat where the
task attempt needs to touch every disk. There will be cases where this feature
will cause tasks to timeout/fail when they didn't before because this will
repeatedly touch bad/slow disks. I'm not sure if it makes more sense to have
this check less often in a background thread rather than inlined with the other
task reporting logic.
Comments on the code changes:
shouldFf does not seem appropriate on the generic TaskAttemptEvent. It's
really only a flag that makes sense on one very specific message type,
TA_FAILMSG. It would be cleaner if there was a derived TaskAttemptEvent class
for task attempt failure events where this would be an appropriate method.
I don't see why failFast needs to be persisted as state in the task. There's
only one transition that needs it, FailedTransition, and the code can convey
the event boolean through to notifyTaskAttemptFailed as an argument to that
method rather than poking into state and having that method pick it up from
there.
Looking up a conf value that isn't changing is wasteful (hashmap lookup
followed by string marshalling). The code can cache the value in an initialize
method and use that cached value.
FileUtil.getDU should be used instead of FileUtils.sizeOfDirectory so we don't
have to add a commons IO dependency to mapreduce-client-core.
TaskAttemptListenerImpl#fatalErrorFF needs an Override decorator.
Nit: Does it make sense to track the largest size when any size above the usage
is going to make it fail? Likely there's only one at the point where it's
going to fail since it's constantly checking, so I'm not seeing the value of
tracking the largest here. Not a must-fix, just curious why it would be
necessary.
Nit: The 'FF' acronym could use a more descriptive name since many won't know
what 'FF' stands for initially. Would it make more sense to call it something
like TaskFailed (e.g.: fatalErrorTaskFailed) to convey this is a fatal error in
an attempt that causes then entire task to fail?
There are a lot of unnecessary whitespace and reformatting changes in
TaskAttemptListenerImpl in areas unrelated to what needs to be changed. That
makes the patch larger than it needs to be and more difficult to backport.
> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
> Key: MAPREDUCE-7022
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 2.7.0, 2.8.0, 2.9.0
> Reporter: Johan Gustavsson
> Assignee: Johan Gustavsson
> Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue
> tasks based on writes to local disk writes. In our environment are we mainly
> run Hive based jobs we noticed that this counter and the size of the local
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local
> disk, so it didn't help us much. So to extend this feature tasks should
> monitor local scratchdir size and fail if they pass the limit. In these cases
> the tasks should not be retried either but instead the job should fast fail.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]