[jira] [Commented] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Johan Gustavsson (JIRA) Wed, 13 Dec 2017 19:23:39 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290279#comment-16290279
 ]


Johan Gustavsson commented on MAPREDUCE-7022:
---------------------------------------------

Thanks for taking the time to review and give detailed feedback [~jlowe]
bq. It's pretty confusing to have both mapreduce.task.local-fs.limit.bytes and 
mapreduce.task.local-fs.write-limit.bytes. 
As you pointed out this is not meant as a single task monitor, but rather a per 
job single disk usage monitor. Most likely most of the naming related to it 
came subconsciously since this patch was heavily inspired by MAPREDUCE-6489. 
I'll rename it to something like mapreduce.job.single-disk.limit.bytes as you 
pointed out and try to fix the description more descriptive.
bq. This is going to add a disk I/O dependency to every task heartbeat where 
the task attempt needs to touch every disk. 
Good point. I like your idea of putting it into a background thread so I'll try 
to rewrite it accordingly.
bq. Comments on the code changes
Will try to fix them all. Main reason I introduced the FF key all over the 
place was to avoid having to touch the actual state machine, but I think I see 
your point in how to avoid doing both, and also clean it up. Also good point 
that most people probably don't know what ff stands for out of context so I'll 
try to make it less cryptic.

Thanks once again, I'll try to have something ready in the next couple of days.

> Fast fail rogue jobs based on task scratch dir size
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-7022
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7022
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 2.7.0, 2.8.0, 2.9.0
>            Reporter: Johan Gustavsson
>            Assignee: Johan Gustavsson
>         Attachments: MAPREDUCE-7022.001.patch, MAPREDUCE-7022.002.patch
>
>
> With the introduction of MAPREDUCE-6489 there are some options to kill rogue 
> tasks based on writes to local disk writes. In our environment are we mainly 
> run Hive based jobs we noticed that this counter and the size of the local 
> scratch dirs were very different. We had tasks where BYTES_WRITTEN counter 
> were at 300Gb and where it was at 10Tb both producing around 200Gb on local 
> disk, so it didn't help us much. So to extend this feature tasks should 
> monitor local scratchdir size and fail if they pass the limit. In these cases 
> the tasks should not be retried either but instead the job should fast fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7022) Fast fail rogue jobs based on task scratch dir size

Reply via email to