[
https://issues.apache.org/jira/browse/HDFS-7830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335185#comment-14335185
]
Chris Nauroth commented on HDFS-7830:
-------------------------------------
Hi [~eddyxu]. Another potential problem that I've noticed in the DataNode
reconfiguration code is that it never recalculates
{{FsDatasetImpl#validVolsRequired}}. This is a {{final}} variable calculated
as (# volumes configured) - (# volume failures tolerated):
{code}
this.validVolsRequired = volsConfigured - volFailuresTolerated;
{code}
If this variable is not updated for DataNode reconfigurations, then it could
lead to some unexpected situations. For example:
# DataNode starts running with 6 volumes (all healthy) and
{{dfs.datanode.failed.volumes.tolerated}} set to 2.
# {{FsDatasetImpl#validVolsRequired}} is set to 6 - 2 = 4.
# DataNode is reconfigured to run with 8 volumes (all still healthy).
# Now 3 volumes fail. The admin would expect the DataNode to abort, but there
are 8 - 3 = 5 good volumes left, and {{FsDatasetImpl#validVolsRequired}} is
still 4, so {{FsDatasetImpl#hasEnoughResource}} returns {{true}}.
Is this something that makes sense for you to address as part of the patch
you're working on now, or would you prefer I file a separate jira to track
this? Thanks!
> DataNode does not release the volume lock when adding a volume fails.
> ---------------------------------------------------------------------
>
> Key: HDFS-7830
> URL: https://issues.apache.org/jira/browse/HDFS-7830
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.6.0
> Reporter: Lei (Eddy) Xu
> Assignee: Lei (Eddy) Xu
>
> When there is a failure in adding volume process, the {{in_use.lock}} is not
> released. Also, doing another {{-reconfig}} to remove the new dir in order to
> cleanup doesn't remove the lock. lsof still shows datanode holding on to the
> lock file.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)