liying created HDFS-14736: ----------------------------- Summary: Starting the datanode unsuccessfully because of the corrupted sub dir in the data directory Key: HDFS-14736 URL: https://issues.apache.org/jira/browse/HDFS-14736 Project: Hadoop HDFS Issue Type: Bug Components: datanode Affects Versions: 2.7.2 Reporter: liying Assignee: liying
If subdirectories in the datanode data directory was corrupted for some reason, the it would restart datanode unsuccessfully. For example, a sudden power failure in the computer room. The error infomation in the datanode log as the follow: 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data06/block/current... 2019-08-09 10:01:06,703 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data07/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data08/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data09/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data10/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data11/block/current... 2019-08-09 10:01:06,704 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-518068284-10.252.12.3-152341691 1512 on volume /data12/block/current... 2019-08-09 10:01:06,707 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Caught exception while scanning /data05/block/current. Will throw later. *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp* at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.<init>(BlockPoolSlice.java:138) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.addBlockPool(FsVolumeImpl.java:837) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2.run(FsVolumeList.java:406) 2019-08-09 10:01:15,330 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data06/block/current: 8627ms 2019-08-09 10:01:15,348 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data11/block/current: 8645ms 2019-08-09 10:01:15,352 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data01/block/current: 8649ms 2019-08-09 10:01:15,361 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data12/block/current: 8658ms 2019-08-09 10:01:15,362 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-518068284-10.252.12.3 -1523416911512 on /data03/block/current: 8659ms I check the codes of the whole process, and find some code are weird in the #DataNode# and #FsVolumeImpl# as the follow: {code:java} //代码占位符 void initBlockPool(BPOfferService bpos) throws IOException { NamespaceInfo nsInfo = bpos.getNamespaceInfo(); if (nsInfo == null) { throw new IOException("NamespaceInfo not found: Block pool " + bpos + " should have retrieved namespace info before initBlockPool."); } setClusterId(nsInfo.clusterID, nsInfo.getBlockPoolID()); // Register the new block pool with the BP manager. blockPoolManager.addBlockPool(bpos); // In the case that this is the first block pool to connect, initialize // the dataset, block scanners, etc. initStorage(nsInfo); // Exclude failed disks before initializing the block pools to avoid startup // failures. checkDiskError(); data.addBlockPool(nsInfo.getBlockPoolID(), conf); blockScanner.enableBlockPoolId(bpos.getBlockPoolId()); initDirectoryScanner(conf); } {code} {code:java} //代码占位符 void checkDirs() throws DiskErrorException { // TODO:FEDERATION valid synchronization for(BlockPoolSlice s : bpSlices.values()) { s.checkDirs(); } }{code} during restarting the datanode, BPServiceActor will invoke initBlockPool to init the datastorage in this blockpool. It will execute checkDirs before addBlockPool. But I found the bpSlices is empty when the checkDirs was executed. So it is very weird. Then i check the codes as the follow: {code:java} //代码占位符 void addBlockPool(String bpid, Configuration conf) throws IOException { File bpdir = new File(currentDir, bpid); BlockPoolSlice bp = new BlockPoolSlice(bpid, this, bpdir, conf); bpSlices.put(bpid, bp); } {code} As you can see, the addBlockPool is executed after the checkDirs. So the bpSlices is empty in this case. And then it will throw *java.io.IOException: Mkdirs failed to create /data05/block/current/BP-518068284-10.252.12.3-1523416911512/tmp, restarting datanode unsuccessfully.* For example, the tmp dir was corrupted with the information as follow: *ls: cannot access tmp: Input/output error* *total 0* *d????????? ? ? ? ? ? tmp* -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org