[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

Zhihai Xu (Jira) Tue, 27 Jul 2021 19:45:04 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhihai Xu updated HDFS-16111:
-----------------------------
    Description: 
When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
failed volume on a lot of datanodes, which cause some missing blocks at that 
time. Although later on we recovered all the missing blocks by symlinking the 
path (dfs/dn/current) on the failed volume to a new directory and copying all 
the data to the new directory, we missed our SLA and it delayed our upgrading 
process on our production cluster for several hours.

When this issue happened, we saw a lot of this exceptions happened before the 
volumed failed on the datanode:

 [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
[Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]] 
datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in 
BlockReceiver constructor :Possible disk error: Failed to create 
/XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause is
 java.io.IOException: No space left on device
         at java.io.UnixFileSystem.createFileExclusively(Native Method)
         at java.io.File.createNewFile(File.java:1012)
         at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
         at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
         at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
         at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
         at java.lang.Thread.run(Thread.java:748)

 

We found this issue happened due to the following two reasons:

First the upgrade process added some extra disk storage on the each disk volume 
of the data node:

BlockPoolSliceStorage.doUpgrade 
([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445])
 is the main upgrade function in the datanode, it will add some extra storage. 
The extra storage added is all new directories created in 
/current/<bpid>/current, although all block data file and block meta data file 
are hard-linked with /current/<bpid>/previous after upgrade. Since there will 
be a lot of new directories created, this will use some disk space on each disk 
volume.

 

Second there is a potential bug when picking a disk volume to write a new block 
file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code 
to select a disk will check whether the available space on the selected disk is 
more than the size bytes of block file to store 
([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86])
 But when creating a new block, there will be two files created: one is the 
block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta, this 
is the code when finalizing a block, both block file size and meta data file 
size will be updated: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391]
 the current code only considers the size of block file and doesn't consider 
the size of block metadata file, when choosing a disk in 
RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received 
at the same time, the default maximum number of DataXceiver threads is 4096. 
This will underestimate the total size needed to write a block, which will 
potentially cause the above disk full error(No space left on device).

 

Since the size of the block metadata file is not fixed, I suggest to add a 
configuration(

dfs.datanode.round-robin-volume-choosing-policy.additional-available-space

) to safeguard the disk space when choosing a volume to write a new block data 
in RoundRobinVolumeChoosingPolicy.

 

  was:
When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
failed volume on a lot of datanodes, which cause some missing blocks at that 
time. Although later on we recovered all the missing blocks by symlinking the 
path (dfs/dn/current) on the failed volume to a new directory and copying all 
the data to the new directory, we missed our SLA and it delayed our upgrading 
process on our production cluster for several hours.

When this issue happened, we saw a lot of this exceptions happened before the 
volumed failed on the datanode:

 [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
[Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]] 
datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in 
BlockReceiver constructor :Possible disk error: Failed to create 
/XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause is
java.io.IOException: No space left on device
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:1012)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
        at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
        at java.lang.Thread.run(Thread.java:748)

 

We found this issue happened due to the following two reasons:

First the upgrade process added some extra disk storage on the each disk volume 
of the data node:

BlockPoolSliceStorage.doUpgrade 
(https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
 is the main upgrade function in the datanode, it will add some extra storage. 
The extra storage added is all new directories created in 
/current/<bpid>/current, although all block data file and block meta data file 
are hard-linked with /current/<bpid>/previous after upgrade. Since there will 
be a lot of new directories created, this will use some disk space on each disk 
volume.

 

Second there is a potential bug when picking a disk volume to write a new block 
file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code 
to select a disk will check whether the available space on the selected disk is 
more than the size bytes of block file to store 
(https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
 But when creating a new block, there will be two files created: one is the 
block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta, this 
is the code when finalizing a block, both block file size and meta data file 
size will be updated: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
 the current code only considers the size of block file and doesn't consider 
the size of block metadata file, when choosing a disk in 
RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received 
at the same time, the default maximum number of DataXceiver threads is 4096. 
This will underestimate the total size needed to write a block, which will 
potentially cause the above disk full error(No space left on device).

 

Since the size of the block metadata file is not fixed, I suggest to add a 
configuration(

dfs.datanode.round-robin-volume-choosing-policy.additional-available-space

) to safeguard the disk space when choosing a volume to write a new block data 
in RoundRobinVolumeChoosingPolicy.

The default value can be 0 for backward compatibility.


> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16111
>                 URL: https://issues.apache.org/jira/browse/HDFS-16111
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Zhihai Xu
>            Assignee: Zhihai Xu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XXXXXX-XX.XX.XX.XX-XXXXXX:blk_XXXXX_XXXXXXX]] 
> datanode.DataNode (BlockReceiver.java:<init>(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXXXXXX/dfs/dn/current/BP-XXXXXX-XX.XX.XX.XX-XXXXXXXXX/tmp/blk_XXXXXX. Cause 
> is
>  java.io.IOException: No space left on device
>          at java.io.UnixFileSystem.createFileExclusively(Native Method)
>          at java.io.File.createNewFile(File.java:1012)
>          at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>          at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>          at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>          at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>          at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>          at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>          at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:212)
>          at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>          at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>          at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>          at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>          at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>          at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445])
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current/<bpid>/current, although all block data file and block meta data 
> file are hard-linked with /current/<bpid>/previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> ([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86])
>  But when creating a new block, there will be two files created: one is the 
> block file blk_XXXX, the other is block metadata file blk_XXXX_XXXX.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391]
>  the current code only considers the size of block file and doesn't consider 
> the size of block metadata file, when choosing a disk in 
> RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks 
> received at the same time, the default maximum number of DataXceiver threads 
> is 4096. This will underestimate the total size needed to write a block, 
> which will potentially cause the above disk full error(No space left on 
> device).
>  
> Since the size of the block metadata file is not fixed, I suggest to add a 
> configuration(
> dfs.datanode.round-robin-volume-choosing-policy.additional-available-space
> ) to safeguard the disk space when choosing a volume to write a new block 
> data in RoundRobinVolumeChoosingPolicy.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

Reply via email to