[
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144374#comment-13144374
]
Hitesh Shah commented on MAPREDUCE-3121:
----------------------------------------
Was doing a brief review on the patch. A couple of points I saw - will go
through the patch in more detail later.
1. I believe the return in ContainersLaunch#handle when no good disks are
available requires a ContainerExitedWithFailure event to be triggered for the
container state machine to handle the failure and clean up appropriately. When
you do this, I think in the cleanup case, we need to add a null check on
rContainerDatum. Please add appropriate test cases to see that the container
state machine works correctly if all disks have failed.
2. A general comment about the handling of passing around of the diskchecker
object or calling of the getLocalDirs()/getLogDirs() functions. Given that
there is a timer task involved that will be updating the values in the
background, there may various inconsistencies which may crop up based on when
the functions are called.
For example, in the Linux Executor, the final command being constructed uses 2
calls:
- checker.getLocalDirsString()
- checker.getLocalPaths()
There could be modifications done during the time that elapses between the 2
calls that may create issues. Also, now the executor also needs to check if
there is atleast one good disk available.
IMO, one approach we could take is to do a check at the top level. For example,
ContainersLaunch as per the patch does a sanity check on available disks and
bails if there is an issue. It could create a snapshot of available disks at
this point
and pass them to the ContainerLaunch which in turn will pass them on to the
executor's launch container call. This will also probably help in simplifying
the code in that the low level components no longer need to be worried about
checking if there are any good disks available.
> NodeManager should handle disk-failures
> ---------------------------------------
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2, nodemanager
> Affects Versions: 0.23.0
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Ravi Gummadi
> Fix For: 0.23.1
>
> Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to
> minimize the impact of transient/permanent disk failures on containers. With
> larger number of disks per node, the ability to continue to run containers on
> other disks is crucial.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira