[ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-16111.
------------------------------------
    Fix Version/s: 3.4.0
       Resolution: Fixed

Thanks!

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16111
>                 URL: https://issues.apache.org/jira/browse/HDFS-16111
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Zhihai Xu
>            Assignee: Zhihai Xu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]] 
> datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current/<bpid>/current, although all block data file and block meta data 
> file are hard-linked with /current/<bpid>/previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
>  But when creating a new block, there will be two files created: one is the 
> block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
>  the current code only considers the size of block file and doesn't consider 
> the size of block metadata file, when choosing a disk in 
> RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks 
> received at the same time, the default maximum number of DataXceiver threads 
> is 4096. This will underestimate the total size needed to write a block, 
> which will potentially cause the above disk full error(No space left on 
> device).
>  
> Since the size of the block metadata file is not fixed, I suggest to add a 
> configuration(
> dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
> ) to safeguard the disk space when choosing a volume to write a new block 
> data in RoundRobinVolumeChoosingPolicy.
> The default value can be 0 for backward compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to