[ https://issues.apache.org/jira/browse/HDFS-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335185#comment-14335185 ]
Chris Nauroth commented on HDFS-7830: ------------------------------------- Hi [~eddyxu]. Another potential problem that I've noticed in the DataNode reconfiguration code is that it never recalculates {{FsDatasetImpl#validVolsRequired}}. This is a {{final}} variable calculated as (# volumes configured) - (# volume failures tolerated): {code} this.validVolsRequired = volsConfigured - volFailuresTolerated; {code} If this variable is not updated for DataNode reconfigurations, then it could lead to some unexpected situations. For example: # DataNode starts running with 6 volumes (all healthy) and {{dfs.datanode.failed.volumes.tolerated}} set to 2. # {{FsDatasetImpl#validVolsRequired}} is set to 6 - 2 = 4. # DataNode is reconfigured to run with 8 volumes (all still healthy). # Now 3 volumes fail. The admin would expect the DataNode to abort, but there are 8 - 3 = 5 good volumes left, and {{FsDatasetImpl#validVolsRequired}} is still 4, so {{FsDatasetImpl#hasEnoughResource}} returns {{true}}. Is this something that makes sense for you to address as part of the patch you're working on now, or would you prefer I file a separate jira to track this? Thanks! > DataNode does not release the volume lock when adding a volume fails. > --------------------------------------------------------------------- > > Key: HDFS-7830 > URL: https://issues.apache.org/jira/browse/HDFS-7830 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.6.0 > Reporter: Lei (Eddy) Xu > Assignee: Lei (Eddy) Xu > > When there is a failure in adding volume process, the {{in_use.lock}} is not > released. Also, doing another {{-reconfig}} to remove the new dir in order to > cleanup doesn't remove the lock. lsof still shows datanode holding on to the > lock file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)