[jira] [Commented] (MAPREDUCE-7148) Fast fail jobs when exceeds dfs quota limitation

Jason Lowe (JIRA) Tue, 30 Oct 2018 08:04:11 -0700


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668838#comment-16668838
 ]


Jason Lowe commented on MAPREDUCE-7148:
---------------------------------------

Thanks for updating the patch!  Looks good overall, just some cleanup nits.

I'm personally not a fan of LimitedPrivate.  I think it has limited (ha!) 
utility in practice.  For example, what is Tez supposed to do if they want to 
implement the same feature?  Are they not allowed to do so until LimitedPrivate 
is removed from the class?  If so then we need to file a followup JIRA to 
remember to revisit this annotation.  I wonder if marking it Unstable is more 
useful than LimitedPrivate in practice as a "buyer beware" for those that want 
to use it downstream and are willing to risk a future incompatibility on a 
later version of Hadoop to use it.  Not a must-change, but I'm curious what 
[[email protected]] thinks about it especially in the very likely scenario 
that Tez wants to replicate this feature.

reportError should not lookup the conf value until it's necessary, i.e.: the 
exception is known to be the relevant type.

The difference between WARN and ERROR for the log level is subtle and arguably 
an odd choice to use WARN.  This error is fatal to the task, i.e.: the entity 
emitting the log.  IMHO logging at the error level is the bare minimum level 
since this exception is terminating the task attempt, the entity emitting the 
log message.  I'd much rather see the log mentioning that it is requesting the 
job to be terminated rather than expecting users to notice it's a WARN vs. an 
ERROR to know the difference.

Nit: A lot of the tests are copy-n-paste.  A private method that takes the 
exception to throw and what to expect for the fail job flag would help.


> Fast fail jobs when exceeds dfs quota limitation
> ------------------------------------------------
>
>                 Key: MAPREDUCE-7148
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7148
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>         Environment: hadoop 2.7.3
>            Reporter: Wang Yan
>            Assignee: Jason Lowe
>            Priority: Major
>         Attachments: MAPREDUCE-7148.001.patch, MAPREDUCE-7148.002.patch, 
> MAPREDUCE-7148.003.patch, MAPREDUCE-7148.004.patch, MAPREDUCE-7148.005.patch, 
> MAPREDUCE-7148.006.patch, MAPREDUCE-7148.007.patch, MAPREDUCE-7148.008.patch
>
>
> We are running hive jobs with a DFS quota limitation per job(3TB). If a job 
> hits DFS quota limitation, the task that hit it will fail and there will be a 
> few task reties before the job actually fails. The retry is not very helpful 
> because the job will always fail anyway. In some worse cases, we have a job 
> which has a single reduce task writing more than 3TB to HDFS over 20 hours, 
> the reduce task exceeds the quota limitation and retries 4 times until the 
> job fails in the end thus consuming a lot of unnecessary resource. This 
> ticket aims at providing the feature to let a job fail fast when it writes 
> too much data to the DFS and exceeds the DFS quota limitation. The fast fail 
> feature is introduced in MAPREDUCE-7022 and MAPREDUCE-6489 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7148) Fast fail jobs when exceeds dfs quota limitation

Reply via email to