[jira] [Updated] (HDFS-14736) Starting the datanode unsuccessfully because of the corrupted sub dir in the data directory

liying (JIRA) Wed, 14 Aug 2019 23:43:34 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-14736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liying updated HDFS-14736:
--------------------------
    Description: 
If subdirectories in the datanode data directory was corrupted for some reason, 
the it would restart datanode unsuccessfully. 
 For example, a sudden power failure in the computer room. The error infomation 
in the datanode log as the follow:
{panel:title=datanode log:}
2019-08-09 10:01:06,703 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data06/block/current...
 2019-08-09 10:01:06,703 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data07/block/current...
 2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data08/block/current...
 2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data09/block/current...
 2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data10/block/current...
 2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data11/block/current...
 2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
 1512 on volume /data12/block/current...
 2019-08-09 10:01:06,707 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught 
exception while scanning /data05/block/current.
 Will throw later.
 *java.io.IOException: Mkdirs failed to create 
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
 2019-08-09 10:01:15,330 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
 -1523416911512 on /data06/block/current: 8627ms
 2019-08-09 10:01:15,348 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
 -1523416911512 on /data11/block/current: 8645ms
 2019-08-09 10:01:15,352 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
 -1523416911512 on /data01/block/current: 8649ms
 2019-08-09 10:01:15,361 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
 -1523416911512 on /data12/block/current: 8658ms
 2019-08-09 10:01:15,362 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
 -1523416911512 on /data03/block/current: 8659ms
{panel}
 

I check the codes of the whole process， and find some codes are weird in the 
#DataNode# and #FsVolumeImpl# as the follow:
{code:java}
void initBlockPool(BPOfferService bpos) throws IOException {
  NamespaceInfo nsInfo = bpos.getNamespaceInfo();
  if (nsInfo == null) {
    throw new IOException("NamespaceInfo not found: Block pool " + bpos
        + " should have retrieved namespace info before initBlockPool.");
  }
  
  setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

  // Register the new block pool with the BP manager.
  blockPoolManager.addBlockPool(bpos);
  
  // In the case that this is the first block pool to connect, initialize
  // the dataset, block scanners, etc.
  initStorage(nsInfo);

  // Exclude failed disks before initializing the block pools to avoid startup
  // failures.
  checkDiskError();

  data.addBlockPool(nsInfo.getBlockPoolID(), conf);
  blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
  initDirectoryScanner(conf);
}
{code}
{code:java}
void checkDirs() throws DiskErrorException {
  // TODO:FEDERATION valid synchronization
  for(BlockPoolSlice s : bpSlices.values()) {
    s.checkDirs();
  }
}{code}
during restarting the datanode, BPServiceActor will invoke initBlockPool to 
init the datastorage in this blockpool. It will execute checkDirs before 
addBlockPool. But I found the bpSlices is empty when the checkDirs was 
executed. So it is very weird. Then i check the codes as the follow:
{code:java}
void addBlockPool(String bpid, Configuration conf) throws IOException {
  File bpdir = new File(currentDir, bpid);
  BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
  bpSlices.put(bpid, bp);
}
{code}
As you can see, the addBlockPool is executed after the checkDirs. So the 
bpSlices is empty in this case. And then it will throw *java.io.IOException: 
Mkdirs failed to create 
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp,  restarting 
datanode unsuccessfully.*

For example， the tmp dir was corrupted  with the information as follow:

*ls: cannot access tmp: Input/output error*
 *total 0*
 *d????????? ? ? ? ? ? tmp*

 

 

 

 

 

 

 

 

 

 

 

  was:
If subdirectories in the datanode data directory was corrupted for some reason, 
the it would restart datanode unsuccessfully. 
For example, a sudden power failure in the computer room. The error infomation 
in the datanode log as the follow:

2019-08-09 10:01:06,703 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data06/block/current...
2019-08-09 10:01:06,703 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data07/block/current...
2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data08/block/current...
2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data09/block/current...
2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data10/block/current...
2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data11/block/current...
2019-08-09 10:01:06,704 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
block pool BP-518068284-10.252.12.3-152341691
1512 on volume /data12/block/current...
2019-08-09 10:01:06,707 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught 
exception while scanning /data05/block/current.
 Will throw later.
*java.io.IOException: Mkdirs failed to create 
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
2019-08-09 10:01:15,330 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data06/block/current: 8627ms
2019-08-09 10:01:15,348 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data11/block/current: 8645ms
2019-08-09 10:01:15,352 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data01/block/current: 8649ms
2019-08-09 10:01:15,361 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data12/block/current: 8658ms
2019-08-09 10:01:15,362 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken 
to scan block pool BP-518068284-10.252.12.3
-1523416911512 on /data03/block/current: 8659ms

 

 

I check the codes of the whole process， and find some code are weird in the 
#DataNode# and #FsVolumeImpl# as the follow:
{code:java}
//代码占位符
void initBlockPool(BPOfferService bpos) throws IOException {
  NamespaceInfo nsInfo = bpos.getNamespaceInfo();
  if (nsInfo == null) {
    throw new IOException("NamespaceInfo not found: Block pool " + bpos
        + " should have retrieved namespace info before initBlockPool.");
  }
  
  setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());

  // Register the new block pool with the BP manager.
  blockPoolManager.addBlockPool(bpos);
  
  // In the case that this is the first block pool to connect, initialize
  // the dataset, block scanners, etc.
  initStorage(nsInfo);

  // Exclude failed disks before initializing the block pools to avoid startup
  // failures.
  checkDiskError();

  data.addBlockPool(nsInfo.getBlockPoolID(), conf);
  blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
  initDirectoryScanner(conf);
}
{code}
{code:java}
//代码占位符
void checkDirs() throws DiskErrorException {
  // TODO:FEDERATION valid synchronization
  for(BlockPoolSlice s : bpSlices.values()) {
    s.checkDirs();
  }
}{code}
during restarting the datanode, BPServiceActor will invoke initBlockPool to 
init the datastorage in this blockpool. It will execute checkDirs before 
addBlockPool. But I found the bpSlices is empty when the checkDirs was 
executed. So it is very weird. Then i check the codes as the follow:
{code:java}
//代码占位符
void addBlockPool(String bpid, Configuration conf) throws IOException {
  File bpdir = new File(currentDir, bpid);
  BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
  bpSlices.put(bpid, bp);
}
{code}
As you can see, the addBlockPool is executed after the checkDirs. So the 
bpSlices is empty in this case. And then it will throw *java.io.IOException: 
Mkdirs failed to create 
/data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp,  restarting 
datanode unsuccessfully.*

For example， the tmp dir was corrupted  with the information as follow:

*ls: cannot access tmp: Input/output error*
*total 0*
*d????????? ? ? ? ? ? tmp*

 

 

 

 

 

 

 

 

 

 

 


> Starting the datanode unsuccessfully because of the corrupted sub dir in the 
> data directory
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-14736
>                 URL: https://issues.apache.org/jira/browse/HDFS-14736
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.2
>            Reporter: liying
>            Assignee: liying
>            Priority: Major
>
> If subdirectories in the datanode data directory was corrupted for some 
> reason, the it would restart datanode unsuccessfully. 
>  For example, a sudden power failure in the computer room. The error 
> infomation in the datanode log as the follow:
> {panel:title=datanode log:}
> 2019-08-09 10:01:06,703 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data06/block/current...
>  2019-08-09 10:01:06,703 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data07/block/current...
>  2019-08-09 10:01:06,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data08/block/current...
>  2019-08-09 10:01:06,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data09/block/current...
>  2019-08-09 10:01:06,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data10/block/current...
>  2019-08-09 10:01:06,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data11/block/current...
>  2019-08-09 10:01:06,704 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning 
> block pool BP-518068284-10.252.12.3-152341691
>  1512 on volume /data12/block/current...
>  2019-08-09 10:01:06,707 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught 
> exception while scanning /data05/block/current.
>  Will throw later.
>  *java.io.IOException: Mkdirs failed to create 
> /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp*
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406)
>  2019-08-09 10:01:15,330 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time 
> taken to scan block pool BP-518068284-10.252.12.3
>  -1523416911512 on /data06/block/current: 8627ms
>  2019-08-09 10:01:15,348 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time 
> taken to scan block pool BP-518068284-10.252.12.3
>  -1523416911512 on /data11/block/current: 8645ms
>  2019-08-09 10:01:15,352 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time 
> taken to scan block pool BP-518068284-10.252.12.3
>  -1523416911512 on /data01/block/current: 8649ms
>  2019-08-09 10:01:15,361 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time 
> taken to scan block pool BP-518068284-10.252.12.3
>  -1523416911512 on /data12/block/current: 8658ms
>  2019-08-09 10:01:15,362 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time 
> taken to scan block pool BP-518068284-10.252.12.3
>  -1523416911512 on /data03/block/current: 8659ms
> {panel}
>  
> I check the codes of the whole process， and find some codes are weird in the 
> #DataNode# and #FsVolumeImpl# as the follow:
> {code:java}
> void initBlockPool(BPOfferService bpos) throws IOException {
>   NamespaceInfo nsInfo = bpos.getNamespaceInfo();
>   if (nsInfo == null) {
>     throw new IOException("NamespaceInfo not found: Block pool " + bpos
>         + " should have retrieved namespace info before initBlockPool.");
>   }
>   
>   setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID());
>   // Register the new block pool with the BP manager.
>   blockPoolManager.addBlockPool(bpos);
>   
>   // In the case that this is the first block pool to connect, initialize
>   // the dataset, block scanners, etc.
>   initStorage(nsInfo);
>   // Exclude failed disks before initializing the block pools to avoid startup
>   // failures.
>   checkDiskError();
>   data.addBlockPool(nsInfo.getBlockPoolID(), conf);
>   blockScanner.enableBlockPoolId(bpos.getBlockPoolId());
>   initDirectoryScanner(conf);
> }
> {code}
> {code:java}
> void checkDirs() throws DiskErrorException {
>   // TODO:FEDERATION valid synchronization
>   for(BlockPoolSlice s : bpSlices.values()) {
>     s.checkDirs();
>   }
> }{code}
> during restarting the datanode, BPServiceActor will invoke initBlockPool to 
> init the datastorage in this blockpool. It will execute checkDirs before 
> addBlockPool. But I found the bpSlices is empty when the checkDirs was 
> executed. So it is very weird. Then i check the codes as the follow:
> {code:java}
> void addBlockPool(String bpid, Configuration conf) throws IOException {
>   File bpdir = new File(currentDir, bpid);
>   BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf);
>   bpSlices.put(bpid, bp);
> }
> {code}
> As you can see, the addBlockPool is executed after the checkDirs. So the 
> bpSlices is empty in this case. And then it will throw *java.io.IOException: 
> Mkdirs failed to create 
> /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp,  restarting 
> datanode unsuccessfully.*
> For example， the tmp dir was corrupted  with the information as follow:
> *ls: cannot access tmp: Input/output error*
>  *total 0*
>  *d????????? ? ? ? ? ? tmp*
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-14736) Starting the datanode unsuccessfully because of the corrupted sub dir in the data directory

Reply via email to