[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Wilfred Spiegelenburg (JIRA) Mon, 04 Feb 2019 18:47:21 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760412#comment-16760412
 ]


Wilfred Spiegelenburg commented on MAPREDUCE-7180:
--------------------------------------------------

I have some reservations also on just growing on a failure. Letting the 
application fail is the best way to get the job reviewed and configured 
correctly. For a properly configured job we should see the GC kick in way 
before we run over the size of the container. If your default settings do not 
take care of that you are not managing the cluster correctly.

In MAPREDUCE-5785 we introduced the automatic calculation of the heap size 
based on the container size and vice versa. If you use that control it should 
mean that you never get into this situation. What happens when the application 
relies on that calculation for the heap and or container size and still fails?
How are you going to handle that case if the container fails with the same 
message? Are you going to also change the ratio heap to container that is 
configured. That case could be caused by the mapper or reducer using more off 
heap memory (3rd party library). How is that going to work with this auto 
re-run?

Another point to consider is that I can always run over the container by 
setting an overly large heap. As an example: I know my job can run in a 1GB 
heap as I have tried it. I now set 10GB as the heap as a test. GCs will not 
kick in as the heap is not really full and will just keep growing way above the 
1GB. If I would configure the job to run in a 2GB container then the overly 
large heap will cause it to fail. It might even fail when I make the container 
4GB or 8GB. Just doubling and re-running is going to be problematic.

Using the available configuration and the smarts that is build in is a far 
better solution.

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Reply via email to