[
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe reassigned MAPREDUCE-7022:
-------------------------------------
Assignee: Johan Gustavsson (was: Jason Lowe)
Thanks for updating the patch!
The common code between the listener's fatalError and fatalErrorFailFast should
be factored out, otherwise someone is going to come along and update one
without updating the other. Right now they are almost complete copies of each
other.
There are many places in the code where it refers to "job" in fast fail when it
really should be "task". A failing task does not necessarily mean the job
fails. I think it would be more clear if FastFail and FailFast are replaced
with FailTask in method names and fields.
It looks like TestTaskImpl is sending T_ATTEMPT_FAILED messages without them
being the proper event type, so event casting in the task transition will fail.
TaskUmbilicalProtocol new doc change refers to failing the job but it actually
fails the task.
Nit: I think it would be cleaner if confs were rooted at
mapreduce.job.local-fs.single-disk-limit, e.g.:
mapreduce.job.local-fs.single-disk-limit.bytes.
The boolean kill default value has a comment stating negative values disable
the limit.
The disk checker should always log rather than only logging when it is not
killing. That way important info relative to the task attempt is logged
whether the task is killed or not. It should arguably be logged as a WARN if
not killing the task and FATAL if we do.
> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
> Key: MAPREDUCE-7022
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 2.7.0, 2.8.0, 2.9.0
> Reporter: Johan Gustavsson
> Assignee: Johan Gustavsson
> Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch,
> MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch,
> MAPREDUCE-7022.006.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue
> tasks based on writes to local disk writes. In our environment are we mainly
> run Hive based jobs we noticed that this counter and the size of the local
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local
> disk, so it didn't help us much. So to extend this feature tasks should
> monitor local scratchdir size and fail if they pass the limit. In these cases
> the tasks should not be retried either but instead the job should fast fail.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]