[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Wilfred Spiegelenburg (JIRA) Sun, 03 Mar 2019 19:40:50 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782937#comment-16782937
 ]


Wilfred Spiegelenburg commented on MAPREDUCE-7180:
--------------------------------------------------

The 80/20 case as DanieI said will not work for all cases but it handles almost 
all use cases. The headroom ratio is configurable which means  that if you know 
you have a high overhead due to the type of code you run you can set it cluster 
wide I would be in favour of not wasting resources and fail the application 
when the JVM goes OOM for one or more tasks. The re-run with adjusted settings 
has more drawbacks than advantages I think.

The main reason I am not in favour of the auto retries is the hiding of 
possible issues and not providing a guarantee that it will work. There is a 
good chance that when one mapper or reducer fails due to memory issues that 
there are more mappers or reducers that will fail in the same way. Multiple 
tasks failing increases the overhead on the cluster like Jim mentioned in his 
example. With data growing or small code changes either in the app, MR 
framework or JVM over time you could be putting a lot of extra strain on a 
cluster. 

What if the application still failed due to task failures: how do we handle an 
application re-run? Won't that start from scratch again and thus waste more 
resources.  


> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Reply via email to