[jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Jason Lowe (JIRA) Fri, 12 Jan 2018 15:04:21 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe reassigned MAPREDUCE-7022:
-------------------------------------

    Assignee: Johan Gustavsson  (was: Jason Lowe)

Thanks for updating the patch!

The common code between the listener's fatalError and fatalErrorFailFast should 
be factored out, otherwise someone is going to come along and update one 
without updating the other.  Right now they are almost complete copies of each 
other.

There are many places in the code where it refers to "job" in fast fail when it 
really should be "task".  A failing task does not necessarily mean the job 
fails.  I think it would be more clear if FastFail and FailFast are replaced 
with FailTask in method names and fields.

It looks like TestTaskImpl is sending T_ATTEMPT_FAILED messages without them 
being the proper event type, so event casting in the task transition will fail.

TaskUmbilicalProtocol new doc change refers to failing the job but it actually 
fails the task.

Nit: I think it would be cleaner if confs were rooted at 
mapreduce.job.local-fs.single-disk-limit, e.g.: 
mapreduce.job.local-fs.single-disk-limit.bytes.

The boolean kill default value has a comment stating negative values disable 
the limit.

The disk checker should always log rather than only logging when it is not 
killing.  That way important info relative to the task attempt is logged 
whether the task is killed or not.  It should arguably be logged as a WARN if 
not killing the task and FATAL if we do.


> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch, 
> MAPREDUCE-7022.003.patch, MAPREDUCE-7022.004.patch, MAPREDUCE-7022.005.patch, 
> MAPREDUCE-7022.006.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue 
> tasks based on writes to local disk writes. In our environment are we mainly 
> run Hive based jobs we noticed that this counter and the size of the local 
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter 
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local 
> disk, so it didn't help us much. So to extend this feature tasks should 
> monitor local scratchdir size and fail if they pass the limit. In these cases 
> the tasks should not be retried either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Assigned] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Reply via email to