[
https://issues.apache.org/jira/browse/HDFS-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liying updated HDFS-14736:
--------------------------
Description:
If subdirectories in the datanode data directory was corrupted for some reason,
the it would restart datanode unsuccessfully.
For example, a sudden power failure in the computer room. The error infomation
in the datanode log as the follow:
{panel:title=datanode log:}
2019-08-09 10:01:06,703 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data06/block/current...
2019-08-09 10:01:06,703 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data07/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data08/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data09/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data10/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data11/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data12/block/current...
2019-08-09 10:01:06,707 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught
exception while scanning /data05/block/current.
Will throw later.
*java.io.IOException: Mkdirs failed to create
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
2019-08-09 10:01:15,330 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data06/block/current: 8627ms
2019-08-09 10:01:15,348 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data11/block/current: 8645ms
2019-08-09 10:01:15,352 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data01/block/current: 8649ms
2019-08-09 10:01:15,361 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data12/block/current: 8658ms
2019-08-09 10:01:15,362 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data03/block/current: 8659ms
{panel}
I check the codes of the whole process, and find some codes are weird in the
#DataNode# and #FsVolumeImpl# as the follow:
{code:java}
void initBlockPool(BPOfferService bpos) throws IOException {
NamespaceInfo nsInfo = bpos.getNamespaceInfo();
if (nsInfo == null) {
throw new IOException("NamespaceInfo not found: Block pool " + bpos
+ " should have retrieved namespace info before initBlockPool.");
}
setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
// Register the new block pool with the BP manager.
blockPoolManager.addBlockPool(bpos);
// In the case that this is the first block pool to connect, initialize
// the dataset, block scanners, etc.
initStorage(nsInfo);
// Exclude failed disks before initializing the block pools to avoid startup
// failures.
checkDiskError();
data.addBlockPool(nsInfo.getBlockPoolID(), conf);
blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
initDirectoryScanner(conf);
}
{code}
{code:java}
void checkDirs() throws DiskErrorException {
// TODO:FEDERATION valid synchronization
for(BlockPoolSlice s : bpSlices.values()) {
s.checkDirs();
}
}{code}
during restarting the datanode, BPServiceActor will invoke initBlockPool to
init the datastorage in this blockpool. It will execute checkDirs before
addBlockPool. But I found the bpSlices is empty when the checkDirs was
executed. So it is very weird. Then i check the codes as the follow:
{code:java}
void addBlockPool(String bpid, Configuration conf) throws IOException {
File bpdir = new File(currentDir, bpid);
BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
bpSlices.put(bpid, bp);
}
{code}
As you can see, the addBlockPool is executed after the checkDirs. So the
bpSlices is empty in this case. And then it will throw *java.io.IOException:
Mkdirs failed to create
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting
datanode unsuccessfully.*
For example, the tmp dir was corrupted with the information as follow:
*ls: cannot access tmp: Input/output error*
*total 0*
*d????????? ? ? ? ? ? tmp*
was:
If subdirectories in the datanode data directory was corrupted for some reason,
the it would restart datanode unsuccessfully.
For example, a sudden power failure in the computer room. The error infomation
in the datanode log as the follow:
2019-08-09 10:01:06,703 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data06/block/current...
2019-08-09 10:01:06,703 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data07/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data08/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data09/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data10/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data11/block/current...
2019-08-09 10:01:06,704 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data12/block/current...
2019-08-09 10:01:06,707 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught
exception while scanning /data05/block/current.
Will throw later.
*java.io.IOException: Mkdirs failed to create
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
at
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
2019-08-09 10:01:15,330 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data06/block/current: 8627ms
2019-08-09 10:01:15,348 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data11/block/current: 8645ms
2019-08-09 10:01:15,352 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data01/block/current: 8649ms
2019-08-09 10:01:15,361 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data12/block/current: 8658ms
2019-08-09 10:01:15,362 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data03/block/current: 8659ms
I check the codes of the whole process, and find some code are weird in the
#DataNode# and #FsVolumeImpl# as the follow:
{code:java}
//代码占位符
void initBlockPool(BPOfferService bpos) throws IOException {
NamespaceInfo nsInfo = bpos.getNamespaceInfo();
if (nsInfo == null) {
throw new IOException("NamespaceInfo not found: Block pool " + bpos
+ " should have retrieved namespace info before initBlockPool.");
}
setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
// Register the new block pool with the BP manager.
blockPoolManager.addBlockPool(bpos);
// In the case that this is the first block pool to connect, initialize
// the dataset, block scanners, etc.
initStorage(nsInfo);
// Exclude failed disks before initializing the block pools to avoid startup
// failures.
checkDiskError();
data.addBlockPool(nsInfo.getBlockPoolID(), conf);
blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
initDirectoryScanner(conf);
}
{code}
{code:java}
//代码占位符
void checkDirs() throws DiskErrorException {
// TODO:FEDERATION valid synchronization
for(BlockPoolSlice s : bpSlices.values()) {
s.checkDirs();
}
}{code}
during restarting the datanode, BPServiceActor will invoke initBlockPool to
init the datastorage in this blockpool. It will execute checkDirs before
addBlockPool. But I found the bpSlices is empty when the checkDirs was
executed. So it is very weird. Then i check the codes as the follow:
{code:java}
//代码占位符
void addBlockPool(String bpid, Configuration conf) throws IOException {
File bpdir = new File(currentDir, bpid);
BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
bpSlices.put(bpid, bp);
}
{code}
As you can see, the addBlockPool is executed after the checkDirs. So the
bpSlices is empty in this case. And then it will throw *java.io.IOException:
Mkdirs failed to create
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting
datanode unsuccessfully.*
For example, the tmp dir was corrupted with the information as follow:
*ls: cannot access tmp: Input/output error*
*total 0*
*d????????? ? ? ? ? ? tmp*
> Starting the datanode unsuccessfully because of the corrupted sub dir in the
> data directory
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-14736
> URL: https://issues.apache.org/jira/browse/HDFS-14736
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.7.2
> Reporter: liying
> Assignee: liying
> Priority: Major
>
> If subdirectories in the datanode data directory was corrupted for some
> reason, the it would restart datanode unsuccessfully.
> For example, a sudden power failure in the computer room. The error
> infomation in the datanode log as the follow:
> {panel:title=datanode log:}
> 2019-08-09 10:01:06,703 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data06/block/current...
> 2019-08-09 10:01:06,703 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data07/block/current...
> 2019-08-09 10:01:06,704 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data08/block/current...
> 2019-08-09 10:01:06,704 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data09/block/current...
> 2019-08-09 10:01:06,704 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data10/block/current...
> 2019-08-09 10:01:06,704 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data11/block/current...
> 2019-08-09 10:01:06,704 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning
> block pool BP-518068284-10.252.12.3-152341691
> 1512 on volume /data12/block/current...
> 2019-08-09 10:01:06,707 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught
> exception while scanning /data05/block/current.
> Will throw later.
> *java.io.IOException: Mkdirs failed to create
> /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
> 2019-08-09 10:01:15,330 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time
> taken to scan block pool BP-518068284-10.252.12.3
> -1523416911512 on /data06/block/current: 8627ms
> 2019-08-09 10:01:15,348 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time
> taken to scan block pool BP-518068284-10.252.12.3
> -1523416911512 on /data11/block/current: 8645ms
> 2019-08-09 10:01:15,352 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time
> taken to scan block pool BP-518068284-10.252.12.3
> -1523416911512 on /data01/block/current: 8649ms
> 2019-08-09 10:01:15,361 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time
> taken to scan block pool BP-518068284-10.252.12.3
> -1523416911512 on /data12/block/current: 8658ms
> 2019-08-09 10:01:15,362 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time
> taken to scan block pool BP-518068284-10.252.12.3
> -1523416911512 on /data03/block/current: 8659ms
> {panel}
>
> I check the codes of the whole process, and find some codes are weird in the
> #DataNode# and #FsVolumeImpl# as the follow:
> {code:java}
> void initBlockPool(BPOfferService bpos) throws IOException {
> NamespaceInfo nsInfo = bpos.getNamespaceInfo();
> if (nsInfo == null) {
> throw new IOException("NamespaceInfo not found: Block pool " + bpos
> + " should have retrieved namespace info before initBlockPool.");
> }
>
> setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
> // Register the new block pool with the BP manager.
> blockPoolManager.addBlockPool(bpos);
>
> // In the case that this is the first block pool to connect, initialize
> // the dataset, block scanners, etc.
> initStorage(nsInfo);
> // Exclude failed disks before initializing the block pools to avoid startup
> // failures.
> checkDiskError();
> data.addBlockPool(nsInfo.getBlockPoolID(), conf);
> blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
> initDirectoryScanner(conf);
> }
> {code}
> {code:java}
> void checkDirs() throws DiskErrorException {
> // TODO:FEDERATION valid synchronization
> for(BlockPoolSlice s : bpSlices.values()) {
> s.checkDirs();
> }
> }{code}
> during restarting the datanode, BPServiceActor will invoke initBlockPool to
> init the datastorage in this blockpool. It will execute checkDirs before
> addBlockPool. But I found the bpSlices is empty when the checkDirs was
> executed. So it is very weird. Then i check the codes as the follow:
> {code:java}
> void addBlockPool(String bpid, Configuration conf) throws IOException {
> File bpdir = new File(currentDir, bpid);
> BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
> bpSlices.put(bpid, bp);
> }
> {code}
> As you can see, the addBlockPool is executed after the checkDirs. So the
> bpSlices is empty in this case. And then it will throw *java.io.IOException:
> Mkdirs failed to create
> /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting
> datanode unsuccessfully.*
> For example, the tmp dir was corrupted with the information as follow:
> *ls: cannot access tmp: Input/output error*
> *total 0*
> *d????????? ? ? ? ? ? tmp*
>
>
>
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]