[jira] [Updated] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

AMC-team (Jira) Fri, 24 Jul 2020 19:13:20 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


AMC-team updated HDFS-15440:
----------------------------
    Description: 
In HDFS disk balancer, configuration parameter 
"dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
means 10%) which defines a good enough move.

The description in hdfs-default.xml is not so clear to me how the value 
actually calculates and works
{quote}When a disk balancer copy operation is proceeding, the datanode is still 
active. So it might not be possible to move the exactly specified amount of 
data. So tolerance allows us to define a percentage which defines a good enough 
move.
{quote}
So I refer to the [official doc of HDFS disk 
balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
 and the description is:
{quote}The tolerance percent specifies when we have reached a good enough value 
for any copy step. For example, if you specify 10 then getting close to 10% of 
the target value is good enough. It is to say if the move operation is 20GB in 
size, if we can move 18GB (20 * (1-10%)) that operation is considered 
successful.
{quote}
However from the source code in DiskBalancer.java
{code:java}
// Inflates bytesCopied and returns true or false. This allows us to stop
// copying if we have reached close enough.
private boolean isCloseEnough(DiskBalancerWorkItem item) {
  long temp = item.getBytesCopied() +
     ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
  return (item.getBytesToCopy() >= temp) ? false : true;
}
{code}
Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
still not enough because 20 > 18 + 18*0.1
Here, we should check whether 18 > 20*(1-0.1).
 The calculation in isLessThanNeeded() (Checks if a given block is less than 
needed size to meet our goal.) is also not intuitive in the same way.

Also, this parameter doesn't have upper bound check, which means you can even 
set it to 1000000% which is obviously wrong value.

*How to fix*

Although this may not lead severe failure, it is better to make it consistent 
between doc and code, and also better to refine the description in 
hdfs-default.xml to make it more precise and clear.

  was:
In HDFS disk balancer, configuration parameter 
"dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
means 10%) which defines a good enough move.

The description in hdfs-default.xml is not so clear to me how the value 
actually calculates and works
{quote}When a disk balancer copy operation is proceeding, the datanode is still 
active. So it might not be possible to move the exactly specified amount of 
data. So tolerance allows us to define a percentage which defines a good enough 
move.
{quote}
So I refer to the [official doc of HDFS disk 
balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
 and the description is:
{quote}The tolerance percent specifies when we have reached a good enough value 
for any copy step. For example, if you specify 10 then getting close to 10% of 
the target value is good enough. It is to say if the move operation is 20GB in 
size, if we can move 18GB (20 * (1-10%)) that operation is considered 
successful.
{quote}
However from the source code in DiskBalancer.java
{code:java}
// Inflates bytesCopied and returns true or false. This allows us to stop
// copying if we have reached close enough.
private boolean isCloseEnough(DiskBalancerWorkItem item) {
  long temp = item.getBytesCopied() +
     ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
  return (item.getBytesToCopy() >= temp) ? false : true;
}
{code}
Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
still not enough because 20 > 18 + 18*0.1
 The calculation in isLessThanNeeded() (Checks if a given block is less than 
needed size to meet our goal.) is also not intuitive in the same way.

*How to fix*

Although this may not lead severe failure, it is better to make it consistent 
between doc and code, and also better to refine the description in 
hdfs-default.xml to make it more precise and clear.


> The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.  
> --------------------------------------------------------------------------
>
>                 Key: HDFS-15440
>                 URL: https://issues.apache.org/jira/browse/HDFS-15440
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer &amp; mover
>            Reporter: AMC-team
>            Priority: Major
>
> In HDFS disk balancer, configuration parameter 
> "dfs.disk.balancer.block.tolerance.percent" is to set a percentage (e.g. 10 
> means 10%) which defines a good enough move.
> The description in hdfs-default.xml is not so clear to me how the value 
> actually calculates and works
> {quote}When a disk balancer copy operation is proceeding, the datanode is 
> still active. So it might not be possible to move the exactly specified 
> amount of data. So tolerance allows us to define a percentage which defines a 
> good enough move.
> {quote}
> So I refer to the [official doc of HDFS disk 
> balancer|https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html]
>  and the description is:
> {quote}The tolerance percent specifies when we have reached a good enough 
> value for any copy step. For example, if you specify 10 then getting close to 
> 10% of the target value is good enough. It is to say if the move operation is 
> 20GB in size, if we can move 18GB (20 * (1-10%)) that operation is considered 
> successful.
> {quote}
> However from the source code in DiskBalancer.java
> {code:java}
> // Inflates bytesCopied and returns true or false. This allows us to stop
> // copying if we have reached close enough.
> private boolean isCloseEnough(DiskBalancerWorkItem item) {
>   long temp = item.getBytesCopied() +
>      ((item.getBytesCopied() * getBlockTolerancePercentage(item)) / 100);
>   return (item.getBytesToCopy() >= temp) ? false : true;
> }
> {code}
> Here, if item.getBytesToCopy() = 20GB, then item.getBytesCopied() = 18GB is 
> still not enough because 20 > 18 + 18*0.1
> Here, we should check whether 18 > 20*(1-0.1).
>  The calculation in isLessThanNeeded() (Checks if a given block is less than 
> needed size to meet our goal.) is also not intuitive in the same way.
> Also, this parameter doesn't have upper bound check, which means you can even 
> set it to 1000000% which is obviously wrong value.
> *How to fix*
> Although this may not lead severe failure, it is better to make it consistent 
> between doc and code, and also better to refine the description in 
> hdfs-default.xml to make it more precise and clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-15440) The usage of dfs.disk.balancer.block.tolerance.percent is not intuitive.

Reply via email to