[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Wilfred Spiegelenburg (JIRA) Tue, 05 Feb 2019 19:17:09 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761403#comment-16761403
 ]


Wilfred Spiegelenburg commented on MAPREDUCE-7180:
--------------------------------------------------

When you use MAPREDUCE-5785 you should not see the type of failures that you 
are trying to prevent. The heap and its overhead should always fit in the 
container unless you have some special off heap case. You should thus only 
expect to see them for the 3rd party library and or off heap issues. What you 
are trying to implement is really only relevant for the edge cases like the 
misconfiguration which you state are not really the goal as the job will still 
fail.

That is why I think adding all this to hide a misconfiguration is the wrong 
thing to do. 

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Reply via email to