[
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312311#comment-16312311
]
Johan Gustavsson commented on MAPREDUCE-7022:
---------------------------------------------
[~jlowe] sorry for the holdup on this due to the holidays but I believe I have
taken most of your points into account in this new version of the patch. While
failFast is still persisted in state the state is kept in FailedTransition
class instead of it's parent class TaskAttemptImpl. Reason for keeping it in
state instead of trying to pass it as an arg to transition is due to the how
large the refactoring would have to be since it's triggered by TA_CLEANUP_DONE
type event. Also as far as I can tell the failing tests seems to be unrelated
to this patch. Please let me know any further concerns or point.
> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
> Key: MAPREDUCE-7022
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 2.7.0, 2.8.0, 2.9.0
> Reporter: Johan Gustavsson
> Assignee: Johan Gustavsson
> Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch,
> MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch,
> MAPREDUCE-7022.006.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue
> tasks based on writes to local disk writes. In our environment are we mainly
> run Hive based jobs we noticed that this counter and the size of the local
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local
> disk, so it didn't help us much. So to extend this feature tasks should
> monitor local scratchdir size and fail if they pass the limit. In these cases
> the tasks should not be retried either but instead the job should fast fail.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]