[
https://issues.apache.org/jira/browse/MAPREDUCE-3121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13151940#comment-13151940
]
Vinod Kumar Vavilapalli commented on MAPREDUCE-3121:
----------------------------------------------------
Okay, still a monster patch, but in a far better shape. I anticipate just one
more iteration.
- Please don't include util like APIs in NodeHealthStatus record, we want to
keep the record implementations to the bare essentials.
- Pass DiskHandlerService everywhere instead of NodeHealthCheckerService.
- Change DISKS_FAILED to not use 144. May be -1001.
- Remove unused imports in ContainersLauncher.
- Remove the commented out init() code in ContainerExecutor.
- Rename LocalStorage to DirectoryCollection?
- getHealthScriptTimer() belongs to the HealthScriptRunner itself. Let's make
nodeHealthScriptRunner.getTimerTask() public and drop TimerTask
getHealthScriptTimer() from NodeHealthCheckerService.
- Trivial: (java)doc NodeHealthCheckerService class.
ContainerLaunch: When all disks have failed, use the health-report in the
exception *and* also add a diagnostics to the event.
- Same in ResourceLocalizationService
- DiskHandlerService: when major-percentage disks are gone, log the report.
(+108)
ResourceLocalizationService:
- Take a snapshot of dirs before the health-check for startLocalizer()?
- PublicLocalizer uses a LocalDirAllocator for downloading file. Should it
instead use DiskHandlerService? May be also check for min-percentage disks to
be alive for each addResource() request. You will need changes to FSDownload
too.
- Remove PUBCACHE_CTXT after doing above.
AppLogAggregatorImpl:
- Existing log message at +120 can also list the good dirs. Bad dirs can be
deduced from the DHS logs.
DiskHandlerService:
- The APIs with size are not needed or don't need the size paramater itself.
- Take a lock on the cloned config for accesses via
updateDirsInConfiguration(), getLocalPathForWrite(String pathStr) etc. where
configuration is accessed.
MiniYARNCluster:
- Change the default the numLocalDirs and numLogDirs to 4? Also, consolidate
the constructors? I can see the N number of constructors pattern of
MiniMRCluster, let's avoid that.
conf/controller.cfg
- Update to not have the removed configs.
- Can you also add banned.users and min.user.id with the default values?
TestDiskFailure:
- verifyDisksHealth(): Loop through and wait for a max of say 10 seconds for
the node to turn unhealthy.
- waitForDiskHealthCheck(): We can capture DiskHandlerService's last report
time and wait till it changes atleast once. Of course that should be capped by
a upper limit on the wait time.
Can you run the linux-container-executor tests: TestLinuxContainerExecutor and
TestContainerManagerWithLCE?
Create a separate ticket for handling the disks that come back up online.
Create a separate ticket for having a metric for numFailedDirs.
-----
Test plan:
- RM stops scheduling when major-percentage of disks go bad: Done
- Node's DiskHandler recognises bad disks: Done
- Node's DiskHandler recognises minimum percentage of good disks : Done
- Integration test: Run a mapreduce job (so that Shuffle is also verified),
offline some disks, run one more job and verify that both the apps pass. TODO
- LogAggregation test: Verify that logs written on bad disks are ignored for
aggregation (augment TestLogAggregationService) TODO:
- ContainerLaunch: Verify that
-- new containers don't use bad directories(by testing the LOCAL_DIRS env in
a custom map job): TODO
-- if major percentage disks turn bad,
-- container should exit with proper exit code(should be easy with a
custom application). TODO
-- localization for a resource fails TODO
> NodeManager should handle disk-failures
> ---------------------------------------
>
> Key: MAPREDUCE-3121
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-3121
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2, nodemanager
> Affects Versions: 0.23.0
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Ravi Gummadi
> Priority: Blocker
> Fix For: 0.23.1
>
> Attachments: 3121.patch, 3121.v1.1.patch, 3121.v1.patch, 3121.v2.patch
>
>
> This is akin to MAPREDUCE-2413 but for YARN's NodeManager. We want to
> minimize the impact of transient/permanent disk failures on containers. With
> larger number of disks per node, the ability to continue to run containers on
> other disks is crucial.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira