[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

BELUGA BEHR (JIRA) Tue, 05 Feb 2019 09:47:20 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761052#comment-16761052
 ]


BELUGA BEHR commented on MAPREDUCE-7180:
----------------------------------------

{quote}
Letting the application fail is the best way to get the job reviewed and 
configured correctly.
{quote}

Sure, perhaps, but no one wants to deal with a failure at 2AM on a production 
system.  That also assumes the end user (example: business analyst using SQL 
through Hive) understands how to configure YARN containers.  Part of the goal 
is to make this more simple for users: If the configurations aren't perfect, 
then resources/run-time may not be optimal, but it will work.

Because of [MAPREDUCE-5785] this issue need only worry about the container size 
for now.  Adjusting the container size larger, will mean that both JVM heap 
size and overhead size increase together (80/20).  If a user does something 
silly like set the Xmx manually to 10GB with a 1GB container size, well the job 
will still fail.  The job should still adhere to the configured number of 
retries and not just re-run until it succeeds (it obviously may never succeed 
in certain scenarios).

Thanks!

> Relaunching Failed Containers
> -----------------------------
>
>                 Key: MAPREDUCE-7180
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7180
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv1, mrv2
>            Reporter: BELUGA BEHR
>            Priority: Major
>
> In my experience, it is very common that a MR job completely fails because a 
> single Mapper/Reducer container is using more memory than has been reserved 
> in YARN.  The following message is logging the the MapReduce 
> ApplicationMaster:
> {code}
> Container [pid=46028,containerID=container_e54_1435155934213_16721_01_003666] 
> is running beyond physical memory limits. 
> Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual 
> memory used. Killing container.
> {code}
> In this case, the container is re-launched on another node, and of course, it 
> is killed again for the same reason.  This process happens three (maybe 
> four?) times before the entire MapReduce job fails.  It's often said that the 
> definition of insanity is doing the same thing over and over and expecting 
> different results.
> For all intents and purposes, the amount of resources requested by Mappers 
> and Reducers is a fixed amount; based on the default configuration values.  
> Users can set the memory on a per-job basis, but it's a pain, not exact, and 
> requires intimate knowledge of the MapReduce framework and its memory usage 
> patterns.
> I propose that if the MR ApplicationMaster detects that a container is killed 
> because of this specific memory resource constraint, that it requests a 
> larger container for the subsequent task attempt.
> For example, increase the requested memory size by 50% each time the 
> container fails and the task is retried.  This will prevent many Job failures 
> and allow for additional memory tuning, per-Job, after the fact, to get 
> better performance (v.s. fail/succeed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7180) Relaunching Failed Containers

Reply via email to