[
https://issues.apache.org/jira/browse/MAPREDUCE-5652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated MAPREDUCE-5652:
----------------------------------
Attachment: MAPREDUCE-5652-v9-and-YARN-1987.patch
Filed YARN-1987 to cover the DBIterator wrapper and updating the patch to use
that new wrapper class. Note that the patch includes YARN-1987 so Jenkins can
comment.
bq. If ShuffleHandler gets DBException during recoverState as part of
serviceStart, should ShuffleHandler ignore the exception and continue like the
store doesn't exist?
Failure to recover should be a rare situation where the DB is
corrupted/inaccessible or there's some schema incompatibility between versions
if an upgrade occurs during the NM downtime. It should be investigated and
corrected, otherwise the errors will likely be glossed over and we will
continue to fail to shuffle across NM restarts from that point forward despite
the user specifying otherwise.
We could add a config to request a "best effort" mode where it will continue
despite the inability to recover, but is that an NM-wide config, a config just
for the shuffle handler, or something else? If we want a config to control
this I propose we address it in a followup JIRA.
> NM Recovery. ShuffleHandler should handle NM restarts
> -----------------------------------------------------
>
> Key: MAPREDUCE-5652
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5652
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.2.0
> Reporter: Karthik Kambatla
> Assignee: Jason Lowe
> Labels: shuffle
> Attachments: MAPREDUCE-5652-v2.patch, MAPREDUCE-5652-v3.patch,
> MAPREDUCE-5652-v4.patch, MAPREDUCE-5652-v5.patch, MAPREDUCE-5652-v6.patch,
> MAPREDUCE-5652-v7.patch, MAPREDUCE-5652-v8.patch,
> MAPREDUCE-5652-v9-and-YARN-1987.patch, MAPREDUCE-5652.patch
>
>
> ShuffleHandler should work across NM restarts and not require re-running
> map-tasks. On NM restart, the map outputs are cleaned up requiring
> re-execution of map tasks and should be avoided.
--
This message was sent by Atlassian JIRA
(v6.2#6252)