[jira] [Updated] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

Jason Lowe (JIRA) Fri, 11 Apr 2014 15:19:24 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jason Lowe updated MAPREDUCE-5652:
----------------------------------

    Attachment: MAPREDUCE-5652-v5.patch

Thanks for the review, Karthik!

bq. How do we handle applications that finish while the NM is down?

The RM is already telling the NM about finished applications, and since it's 
not creating a new RMNodeImpl when it rejoins it should continue to tell it 
about subsequent finished applications since the last time the node heartbeated.

There may be some races in there where the RM thinks the NM processed the 
heartbeat response but the NM crashed before it completed processing it.  If 
that's an issue we could fix it by having the NM send the list of apps it 
thinks are still active when it re-registers, giving the RM an opportunity to 
correct it on the next heartbeat.  However that's outside of the scope (and 
project) of this JIRA.  I'll try to address that in YARN-1354 or possibly a 
separate YARN JIRA.

bq. Code related to initStateStore should ideally go into serviceInit(), 
primarily to future-proof against us supporting (re)starting stopped services.

serviceStart() indirectly calls initStateStore to open the database, and 
serviceStop() closes the database.  If we want later to support restarting 
stopped services then we need to continue to open the database in 
serviceStart().  Moving the initStateStore call to init rather than start means 
we will try to use a closed database after the service is restarted, or am I 
missing something?

bq. Use the constant JOB here?

Fixed.  I had to change its visibility to public in order to access it.

bq. ShuffleHandler#recordJobShuffleInfo: addJobToken() should come after 
attempt to include in the store?

Fixed.

bq. ShuffleHandler#close() should probably take care of clearing the static 
maps. Alternately, we could just make those maps non-static.

I made userRsrc and secretManager regular members rather than static.  
Originally I wasn't sure why they were static and didn't want to mess with that 
too much as part of this JIRA, but it's problematic for testing this.  
Apparently they always were static, but I couldn't find any reason for them to 
be so.  Even in light of multiple ShuffleHandler instances, I don't see why 
it's something we need (or necessarily even want) to share.  Manually ran the 
unit tests under hadoop-mapreduce-client-jobclient to verify there wasn't 
something fishy going on with the multi-instance ShuffleHandler in mini 
clusters or something like that.

bq. ShuffleHander#forgetJob() - should we make those two maps non-static?

Removed forgetJob() now that the members are not static.

bq. Do we need to change hadoop-mapreduce-project/pom.xml, given we already add 
the dependencies in the shuffle module?

Yes, if I remove the dependency from that pom.xml then the leveldb jar doesn't 
show up in the resulting dist tree under mapreduce/lib/.  Maybe may need the 
equivalent of YARN-888 for hadoop-mapreduce-project poms to only declare them 
in the leaf modules and still have them be picked up properly.

> NM Recovery. ShuffleHandler should handle NM restarts
> -----------------------------------------------------
>
>                 Key: MAPREDUCE-5652
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Jason Lowe
>              Labels: shuffle
>         Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch, 
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running 
> map-tasks. On NM restart, the map outputs are cleaned up requiring 
> re-execution of map tasks and should be avoided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (MAPREDUCE-5652) NM Recovery. ShuffleHandler should handle NM restarts

Reply via email to