[
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-5652:
----------------------------------
Attachment: MAPREDUCE-5652-v5.patch
Thanks for the review, Karthik!
bq. How do we handle applications that finish while the NM is down?
The RM is already telling the NM about finished applications, and since it's
not creating a new RMNodeImpl when it rejoins it should continue to tell it
about subsequent finished applications since the last time the node heartbeated.
There may be some races in there where the RM thinks the NM processed the
heartbeat response but the NM crashed before it completed processing it. If
that's an issue we could fix it by having the NM send the list of apps it
thinks are still active when it re-registers, giving the RM an opportunity to
correct it on the next heartbeat. However that's outside of the scope (and
project) of this JIRA. I'll try to address that in YARN-1354 or possibly a
separate YARN JIRA.
bq. Code related to initStateStore should ideally go into serviceInit(),
primarily to future-proof against us supporting (re)starting stopped services.
serviceStart() indirectly calls initStateStore to open the database, and
serviceStop() closes the database. If we want later to support restarting
stopped services then we need to continue to open the database in
serviceStart(). Moving the initStateStore call to init rather than start means
we will try to use a closed database after the service is restarted, or am I
missing something?
bq. Use the constant JOB here?
Fixed. I had to change its visibility to public in order to access it.
bq. ShuffleHandler#recordJobShuffleInfo: addJobToken() should come after
attempt to include in the store?
Fixed.
bq. ShuffleHandler#close() should probably take care of clearing the static
maps. Alternately, we could just make those maps non-static.
I made userRsrc and secretManager regular members rather than static.
Originally I wasn't sure why they were static and didn't want to mess with that
too much as part of this JIRA, but it's problematic for testing this.
Apparently they always were static, but I couldn't find any reason for them to
be so. Even in light of multiple ShuffleHandler instances, I don't see why
it's something we need (or necessarily even want) to share. Manually ran the
unit tests under hadoop-mapreduce-client-jobclient to verify there wasn't
something fishy going on with the multi-instance ShuffleHandler in mini
clusters or something like that.
bq. ShuffleHander#forgetJob() - should we make those two maps non-static?
Removed forgetJob() now that the members are not static.
bq. Do we need to change hadoop-mapreduce-project/pom.xml, given we already add
the dependencies in the shuffle module?
Yes, if I remove the dependency from that pom.xml then the leveldb jar doesn't
show up in the resulting dist tree under mapreduce/lib/. Maybe may need the
equivalent of YARN-888 for hadoop-mapreduce-project poms to only declare them
in the leaf modules and still have them be picked up properly.
> NM Recovery. ShuffleHandler should handle NM restarts
> -----------------------------------------------------
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.2.0
> Reporter: Karthik Kambatla
> Assignee: Jason Lowe
> Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch,
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running
> map-tasks. On NM restart, the map outputs are cleaned up requiring
> re-execution of map tasks and should be avoided.
--
This message was sent by Atlassian JIRA
(v6.2#6252)