[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639254#comment-16639254
 ] 

Wang Yan edited comment on MAPREDUCE-7148 at 10/5/18 4:55 AM:
--------------------------------------------------------------

[~ozawa] What do you think of such mechanism to remove the compile dependency?

How about create another configuration which let users to specify some 
exception FQCNs, that whenever these exceptions happens, the job will fast fail 
instead of retrying tasks once and once.

For example,

fast.fail.exceptions=org.apache.hadoop.hdfs.protocol.DSQuotaExceededException,...
fast.fail.on.designated.failure=true

(need to rename)

When yarnchild catch an exception from the task, perform string comparison of 
the all exceptions in the stacktrace, if one of them is in the specified 
fast.fail.exceptions list, then fast fail the job.


was (Author: tiana528):
[~ozawa] Or how about create another configuration which let users to specify 
some exception FQCNs, that whenever these exceptions happens, the job will fast 
fail instead of retrying tasks once and once.

> Fast fail jobs when exceeds dfs quota limitation
> ------------------------------------------------
>
>                 Key: MAPREDUCE-7148
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7148
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>         Environment: hadoop 2.7.3
>            Reporter: Wang Yan
>            Priority: Major
>         Attachments: MAPREDUCE-7148.001.patch
>
>
> We are running hive jobs with a DFS quota limitation per job(3TB). If a job 
> hits DFS quota limitation, the task that hit it will fail and there will be a 
> few task reties before the job actually fails. The retry is not very helpful 
> because the job will always fail anyway. In some worse cases, we have a job 
> which has a single reduce task writing more than 3TB to HDFS over 20 hours, 
> the reduce task exceeds the quota limitation and retries 4 times until the 
> job fails in the end thus consuming a lot of unnecessary resource. This 
> ticket aims at providing the feature to let a job fail fast when it writes 
> too much data to the DFS and exceeds the DFS quota limitation. The fast fail 
> feature is introduced in MAPREDUCE-7022 and MAPREDUCE-6489 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to