[ 
https://issues.apache.org/jira/browse/HDFS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035990#comment-13035990
 ] 

Bharath Mundlapudi commented on HDFS-1592:
------------------------------------------

First, Thank you for identifying this issue, Eli. Great job!

Couple of comments,
1. We did test couple of things like masking permissions still dfs level. That 
didn't catch this issue. You pointed in making specific directory permissions 
helped us to reproduce this case. Thanks again.
2. We tested by unmounting disks also.
3. Then we tested with injecting failures at kernel level. 

Regarding testcases,
I agree with you that we need more tests, But I think, we should do that in 
another jira. Since, we have already spent lot of effort in manual testing. Can 
we file another Jira to track this? 

With this new patch, i have tested following new cases. Can you please review 
and provide your feedback?

case 1: All four good volumes, Vol Tolerated=1, expected outcome = BPservice 
should start

11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - 
/grid/0/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - 
/grid/1/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - 
/grid/2/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume - 
/grid/3/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: Registered FSDatasetState MBean
11/05/19 04:57:51 INFO datanode.DataNode: Adding block pool 
BP-1694914230-10.72.86.55-1305704227822
11/05/19 04:57:51 INFO datanode.DirectoryScanner: Periodic Directory Tree 
Verification scan starting at 1305782678947 with interval 21600000
11/05/19 04:57:51 INFO datanode.DataNode: in register: 
sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 04:57:51 INFO datanode.DataNode: bpReg after 
=lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
11/05/19 04:57:51 INFO datanode.DataNode: in 
register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 04:57:51 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 
using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; 
heartBeatInterval=3000
11/05/19 04:57:51 INFO datanode.DataNode: BlockReport of 0 blocks got processed 
in 3 msecs
11/05/19 04:57:51 INFO datanode.DataNode: sent block report, processed 
command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@3e5a91
11/05/19 04:57:51 INFO datanode.BlockPoolSliceScanner: Periodic Block 
Verification scan initialized with interval 1814400000.
11/05/19 04:57:51 INFO datanode.DataBlockScanner: Added 
bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
11/05/19 04:57:56 INFO datanode.BlockPoolSliceScanner: Starting a new period : 
work left in prev period : 0.00%

case 2: One failed volume(/grid/2), three good volumes, Vol Tolerated=1, 
expected outcome = BPService should start

11/05/19 05:01:27 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:01:27 INFO common.Storage: Formatting ...
11/05/19 05:01:27 WARN common.Storage: Invalid directory in: 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
 File 
file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:01:27 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO datanode.DataNode: setting up storage: 
nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - 
/grid/0/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - 
/grid/1/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume - 
/grid/3/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: Registered FSDatasetState MBean
11/05/19 05:01:27 INFO datanode.DataNode: Adding block pool 
BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:01:27 INFO datanode.DirectoryScanner: Periodic Directory Tree 
Verification scan starting at 1305789604425 with interval 21600000
11/05/19 05:01:27 INFO datanode.DataNode: in register: 
sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 05:01:27 INFO datanode.DataNode: bpReg after 
=lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
11/05/19 05:01:27 INFO datanode.DataNode: in 
register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 05:01:27 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020 
using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; 
heartBeatInterval=3000
11/05/19 05:01:27 INFO datanode.DataNode: BlockReport of 0 blocks got processed 
in 4 msecs
11/05/19 05:01:27 INFO datanode.DataNode: sent block report, processed 
command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@1adb7b8
11/05/19 05:01:27 INFO datanode.BlockPoolSliceScanner: Periodic Block 
Verification scan initialized with interval 1814400000.
11/05/19 05:01:27 INFO datanode.DataBlockScanner: Added 
bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
11/05/19 05:01:32 INFO datanode.BlockPoolSliceScanner: Starting a new period : 
work left in prev period : 0.00%

case 3: Two failed volumes(/grid/1,/grid/2), two good volumes, Vol Tolerated=1, 
expected outcome = BPService should NOT start

11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/1/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:04:06 INFO common.Storage: Formatting ...
11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:04:06 INFO common.Storage: Formatting ...
11/05/19 05:04:06 WARN common.Storage: Invalid directory in: 
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
 File 
file:/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 WARN common.Storage: Invalid directory in: 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
 File 
file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 INFO common.Storage: Locking is disabled
11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
 does not exist.
11/05/19 05:04:06 INFO common.Storage: Locking is disabled
11/05/19 05:04:06 INFO datanode.DataNode: setting up storage: 
nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:04:06 FATAL datanode.DataNode: 
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, 
storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, 
ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization 
failed for block pool BP-1694914230-10.72.86.55-1305704227822
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for 
volumes required - validVolsRequired: 3, Current valid volumes: 2, 
volsConfigured: 4, volFailuresTolerated: 1
        at 
org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
        at java.lang.Thread.run(Thread.java:619)
11/05/19 05:04:06 WARN datanode.DataNode: 
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, 
storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075, 
ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block 
pool service for: BP-1694914230-10.72.86.55-1305704227822

case 4: All failed volumes, Vol Tolerated=1, expected outcome = BPService 
should NOT start

11/05/19 05:07:51 INFO common.Storage: Storage directory 
/grid/0/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory 
/grid/1/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory 
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory 
/grid/3/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 FATAL datanode.DataNode: 
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, 
infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization 
failed for block pool BP-1694914230-10.72.86.55-1305704227822
java.io.IOException: All specified directories are not accessible or do not 
exist.
        at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:182)
        at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:217)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
        at java.lang.Thread.run(Thread.java:619)
11/05/19 05:07:51 WARN datanode.DataNode: 
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=, 
infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) ending block 
pool service for: BP-1694914230-10.72.86.55-1305704227822


> Datanode startup doesn't honor volumes.tolerated 
> -------------------------------------------------
>
>                 Key: HDFS-1592
>                 URL: https://issues.apache.org/jira/browse/HDFS-1592
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 0.20.204.0
>            Reporter: Bharath Mundlapudi
>            Assignee: Bharath Mundlapudi
>             Fix For: 0.20.204.0, 0.23.0
>
>         Attachments: HDFS-1592-1.patch, HDFS-1592-2.patch, HDFS-1592-3.patch, 
> HDFS-1592-rel20.patch
>
>
> Datanode startup doesn't honor volumes.tolerated for hadoop 20 version.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to