[
https://issues.apache.org/jira/browse/HDFS-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035990#comment-13035990
]
Bharath Mundlapudi commented on HDFS-1592:
------------------------------------------
First, Thank you for identifying this issue, Eli. Great job!
Couple of comments,
1. We did test couple of things like masking permissions still dfs level. That
didn't catch this issue. You pointed in making specific directory permissions
helped us to reproduce this case. Thanks again.
2. We tested by unmounting disks also.
3. Then we tested with injecting failures at kernel level.
Regarding testcases,
I agree with you that we need more tests, But I think, we should do that in
another jira. Since, we have already spent lot of effort in manual testing. Can
we file another Jira to track this?
With this new patch, i have tested following new cases. Can you please review
and provide your feedback?
case 1: All four good volumes, Vol Tolerated=1, expected outcome = BPservice
should start
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume -
/grid/0/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume -
/grid/1/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume -
/grid/2/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: FSDataset added volume -
/grid/3/testing/hadoop-logs/dfs/data/current
11/05/19 04:57:51 INFO datanode.DataNode: Registered FSDatasetState MBean
11/05/19 04:57:51 INFO datanode.DataNode: Adding block pool
BP-1694914230-10.72.86.55-1305704227822
11/05/19 04:57:51 INFO datanode.DirectoryScanner: Periodic Directory Tree
Verification scan starting at 1305782678947 with interval 21600000
11/05/19 04:57:51 INFO datanode.DataNode: in register:
sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 04:57:51 INFO datanode.DataNode: bpReg after
=lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
11/05/19 04:57:51 INFO datanode.DataNode: in
register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 04:57:51 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020
using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec;
heartBeatInterval=3000
11/05/19 04:57:51 INFO datanode.DataNode: BlockReport of 0 blocks got processed
in 3 msecs
11/05/19 04:57:51 INFO datanode.DataNode: sent block report, processed
command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@3e5a91
11/05/19 04:57:51 INFO datanode.BlockPoolSliceScanner: Periodic Block
Verification scan initialized with interval 1814400000.
11/05/19 04:57:51 INFO datanode.DataBlockScanner: Added
bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
11/05/19 04:57:56 INFO datanode.BlockPoolSliceScanner: Starting a new period :
work left in prev period : 0.00%
case 2: One failed volume(/grid/2), three good volumes, Vol Tolerated=1,
expected outcome = BPService should start
11/05/19 05:01:27 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:01:27 INFO common.Storage: Formatting ...
11/05/19 05:01:27 WARN common.Storage: Invalid directory in:
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
File
file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:01:27 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:01:27 INFO common.Storage: Locking is disabled
11/05/19 05:01:27 INFO datanode.DataNode: setting up storage:
nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume -
/grid/0/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume -
/grid/1/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: FSDataset added volume -
/grid/3/testing/hadoop-logs/dfs/data/current
11/05/19 05:01:27 INFO datanode.DataNode: Registered FSDatasetState MBean
11/05/19 05:01:27 INFO datanode.DataNode: Adding block pool
BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:01:27 INFO datanode.DirectoryScanner: Periodic Directory Tree
Verification scan starting at 1305789604425 with interval 21600000
11/05/19 05:01:27 INFO datanode.DataNode: in register:
sid=DS-340618566-10.72.86.55-50010-1305704313207;SI=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 05:01:27 INFO datanode.DataNode: bpReg after
=lv=-35;cid=test;nsid=413952175;c=0;sid=DS-340618566-10.72.86.55-50010-1305704313207;name=127.0.0.1:50010
11/05/19 05:01:27 INFO datanode.DataNode: in
register:;bpDNR=lv=-35;cid=test;nsid=413952175;c=0
11/05/19 05:01:27 INFO datanode.DataNode: For namenode localhost/127.0.0.1:8020
using BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec;
heartBeatInterval=3000
11/05/19 05:01:27 INFO datanode.DataNode: BlockReport of 0 blocks got processed
in 4 msecs
11/05/19 05:01:27 INFO datanode.DataNode: sent block report, processed
command:org.apache.hadoop.hdfs.server.protocol.DatanodeCommand$Finalize@1adb7b8
11/05/19 05:01:27 INFO datanode.BlockPoolSliceScanner: Periodic Block
Verification scan initialized with interval 1814400000.
11/05/19 05:01:27 INFO datanode.DataBlockScanner: Added
bpid=BP-1694914230-10.72.86.55-1305704227822 to blockPoolScannerMap, new size=1
11/05/19 05:01:32 INFO datanode.BlockPoolSliceScanner: Starting a new period :
work left in prev period : 0.00%
case 3: Two failed volumes(/grid/1,/grid/2), two good volumes, Vol Tolerated=1,
expected outcome = BPService should NOT start
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/1/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:04:06 INFO common.Storage: Formatting ...
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:04:06 INFO common.Storage: Formatting ...
11/05/19 05:04:06 WARN common.Storage: Invalid directory in:
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
File
file:/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 WARN common.Storage: Invalid directory in:
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822:
File
file:/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 INFO common.Storage: Locking is disabled
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/1/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data/current/BP-1694914230-10.72.86.55-1305704227822
does not exist.
11/05/19 05:04:06 INFO common.Storage: Locking is disabled
11/05/19 05:04:06 INFO datanode.DataNode: setting up storage:
nsid=0;bpid=BP-1694914230-10.72.86.55-1305704227822;lv=-35;nsInfo=lv=-35;cid=test;nsid=413952175;c=0;bpid=BP-1694914230-10.72.86.55-1305704227822
11/05/19 05:04:06 FATAL datanode.DataNode:
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010,
storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075,
ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) initialization
failed for block pool BP-1694914230-10.72.86.55-1305704227822
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid value for
volumes required - validVolsRequired: 3, Current valid volumes: 2,
volsConfigured: 4, volFailuresTolerated: 1
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.<init>(FSDataset.java:1160)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.initFsDataSet(DataNode.java:1420)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.access$1100(DataNode.java:169)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:804)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
at java.lang.Thread.run(Thread.java:619)
11/05/19 05:04:06 WARN datanode.DataNode:
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010,
storageID=DS-340618566-10.72.86.55-50010-1305704313207, infoPort=50075,
ipcPort=50020, storageInfo=lv=-35;cid=test;nsid=413952175;c=0) ending block
pool service for: BP-1694914230-10.72.86.55-1305704227822
case 4: All failed volumes, Vol Tolerated=1, expected outcome = BPService
should NOT start
11/05/19 05:07:51 INFO common.Storage: Storage directory
/grid/0/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory
/grid/1/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory
/grid/2/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 INFO common.Storage: Storage directory
/grid/3/testing/hadoop-logs/dfs/data is not formatted.
11/05/19 05:07:51 INFO common.Storage: Formatting ...
11/05/19 05:07:51 FATAL datanode.DataNode:
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=,
infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) initialization
failed for block pool BP-1694914230-10.72.86.55-1305704227822
java.io.IOException: All specified directories are not accessible or do not
exist.
at
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:182)
at
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:217)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBPStorage(DataNode.java:797)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.setupBP(DataNode.java:774)
at
org.apache.hadoop.hdfs.server.datanode.DataNode$BPOfferService.run(DataNode.java:1191)
at java.lang.Thread.run(Thread.java:619)
11/05/19 05:07:51 WARN datanode.DataNode:
DatanodeRegistration(hadooplab40.yst.corp.yahoo.com:50010, storageID=,
infoPort=50075, ipcPort=50020, storageInfo=lv=0;cid=;nsid=0;c=0) ending block
pool service for: BP-1694914230-10.72.86.55-1305704227822
> Datanode startup doesn't honor volumes.tolerated
> -------------------------------------------------
>
> Key: HDFS-1592
> URL: https://issues.apache.org/jira/browse/HDFS-1592
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.20.204.0
> Reporter: Bharath Mundlapudi
> Assignee: Bharath Mundlapudi
> Fix For: 0.20.204.0, 0.23.0
>
> Attachments: HDFS-1592-1.patch, HDFS-1592-2.patch, HDFS-1592-3.patch,
> HDFS-1592-rel20.patch
>
>
> Datanode startup doesn't honor volumes.tolerated for hadoop 20 version.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira