[jira] [Updated] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

Eric Payne (Updated) (JIRA) Thu, 27 Oct 2011 11:26:56 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Payne updated MAPREDUCE-3186:
----------------------------------

    Attachment: MAPREDUCE-3186.v2.txt

I have addressed all of the following issues except for the unit tests. I 
wanted to get the new patch up here so you could look at it while I 
simulatneously address the unit tests.

Mahadev wrote:
> We should probably avoid reading the config entry everytime we call get 
> resources. The maxretry can be inited in init() call.
Done. I initialzed variables in the init.

> Regarding the test, you should be able to mock failure the communicate with 
> the RM and make sure that an internal error is generated. Also, if the MRApp 
> shutsdown on an internal error.
Still working on the unit tests.



Sid wrote:
> can the timeout be changed from #failedAttempts to actual time. Currently the 
> MRAM sends a heartbeat at a specific interval (2s default). This may change 
> to ask for containers the moment a task needs to be launched.
Changed to a timed interval that will keep trying to contact the RM and error 
out only if a certain amount of time has expired.

> The LocalContainerAllocator (used by uberized jobs) would need to be changed 
> as well.
I have changed the LocalContainerAllocator code as well. Good catch.

                
> User jobs are getting hanged if the Resource manager process goes down and 
> comes up while job is getting executed.
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3186
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3186
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>         Environment: linux
>            Reporter: Ramgopal N
>            Assignee: Eric Payne
>            Priority: Blocker
>              Labels: test
>         Attachments: MAPREDUCE-3186.v1.txt, MAPREDUCE-3186.v2.txt
>
>
> If the resource manager is restarted while the job execution is in progress, 
> the job is getting hanged.
> UI shows the job as running.
> In the RM log, it is throwing an error "ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: 
> AppAttemptId doesnt exist in cache appattempt_1318579738195_0004_000001"
> In the console MRAppMaster and Runjar processes are not getting killed

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3186) User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.

Reply via email to