[
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhihai Xu updated HDFS-16111:
-----------------------------
Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid
failed volumes. (was: Add a configuration to RoundRobinVolumeChoosingPolicy to
avoid picking an almost full volume to place a replica. )
> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes.
> ------------------------------------------------------------------------------
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Zhihai Xu
> Assignee: Zhihai Xu
> Priority: Major
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got
> failed volume on a lot of datanodes, which cause some missing blocks at that
> time. Although later on we recovered all the missing blocks by symlinking the
> path (dfs/dn/current) on the failed volume to a new directory and copying all
> the data to the new directory, we missed our SLA and it delayed our upgrading
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the
> volumed failed on the datanode:
> [DataXceiver for client at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/]
> [Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]]
> datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in
> BlockReceiver constructor :Possible disk error: Failed to create
> /XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause
> is
> java.io.IOException: No space left on device
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:1012)
> at
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
> at
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
> at
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
> at java.lang.Thread.run(Thread.java:748)
>
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
> is the main upgrade function in the datanode, it will add some extra
> storage. The extra storage added is all new directories created in
> /current/<bpid>/current, although all block data file and block meta data
> file are hard-linked with /current/<bpid>/previous after upgrade. Since there
> will be a lot of new directories created, this will use some disk space on
> each disk volume.
>
> Second there is a potential bug when picking a disk volume to write a new
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy,
> The code to select a disk will check whether the available space on the
> selected disk is more than the size bytes of block file to store
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
> But when creating a new block, there will be two files created: one is the
> block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta,
> this is the code when finalizing a block, both block file size and meta data
> file size will be updated:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
> the current code only considers the size of block file and doesn't consider
> the size of block metadata file, when choosing a disk in
> RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks
> received at the same time, the default maximum number of DataXceiver threads
> is 4096. This will underestimate the total size needed to write a block,
> which will potentially cause the above disk full error(No space left on
> device).
>
> Since the size of the block metadata file is not fixed, I suggest to add a
> configuration(
> dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
> ) to safeguard the disk space when choosing a volume to write a new block
> data in RoundRobinVolumeChoosingPolicy.
> The default value can be 0 for backward compatibility.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]