[ 
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144374#comment-13144374
 ] 

Hitesh Shah commented on MAPREDUCE-3121:
----------------------------------------

Was doing a brief review on the patch. A couple of points I saw - will go 
through the patch in more detail later. 

1. I believe the return in ContainersLaunch#handle when no good disks are 
available requires a ContainerExitedWithFailure event to be triggered for the 
container state machine to handle the failure and clean up appropriately. When 
you do this, I think in the cleanup case, we need to add a null check on 
rContainerDatum. Please add appropriate test cases to see that the container 
state machine works correctly if all disks have failed.

2. A general comment about the handling of passing around of the diskchecker 
object or calling of the getLocalDirs()/getLogDirs() functions. Given that 
there is a timer task involved that will be updating the values in the 
background, there may various inconsistencies which may crop up based on when 
the functions are called. 

For example, in the Linux Executor, the final command being constructed uses 2 
calls: 
   - checker.getLocalDirsString()
   - checker.getLocalPaths()

There could be modifications done during the time that elapses between the 2 
calls that may create issues. Also, now the executor also needs to check if 
there is atleast one good disk available.

IMO, one approach we could take is to do a check at the top level. For example, 
ContainersLaunch as per the patch does a sanity check on available disks and 
bails if there is an issue. It could create a snapshot of available disks at 
this point 
and pass them to the ContainerLaunch which in turn will pass them on to the 
executor's launch container call. This will also probably help in simplifying 
the code in that the low level components no longer need to be worried about 
checking if there are any good disks available.





 

 
                
> NodeManager should handle disk-failures
> ---------------------------------------
>
>                 Key: MAPREDUCE-3121
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2, nodemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Ravi Gummadi
>             Fix For: 0.23.1
>
>         Attachments: 3121.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to 
> minimize the impact of transient/permanent disk failures on containers. With 
> larger number of disks per node, the ability to continue to run containers on 
> other disks is crucial.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to