[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Wilfred Spiegelenburg (JIRA) Mon, 04 Mar 2019 22:00:46 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16784121#comment-16784121
 ]


Wilfred Spiegelenburg commented on MAPREDUCE-7180:
--------------------------------------------------

What I meant is that if an application fails we re-run the application. Any 
finished tasks are OK they are recovered, running tasks are killed and 
restarted. If they had failed once or more times for the first attempt and we 
relaunched them with larger heaps we start the process of increasing the 
containers again from scratch, wasting more resources.

I think what Daniel proposed is the simplest most elegant solution. If we have 
a task that fails due to exceeding the container we should fail the application 
and let the end user and or admin sort it out. Even for an Oozie workflow or in 
the Hive case running jobs through beeline you can set the size of the 
container etc via the command line.
I think finding the cause is not that difficult but as part of the change to 
fail the application we could make it really clear in the diagnostics of the 
application what failed and which action to take. The message for the container 
exceeding the settings has already been extended via YARN-7580 and should be 
clearer in 3.1 and later.

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: David Mollitor
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Reply via email to