[ 
https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dirk Daems updated HDFS-17673:
------------------------------
    Description: 
When running the HDFS balancer as a service using the following command

 
{code:java}
hdfs balancer -asService {code}
the first balancing round succeeds:
{code:java}
$ hdfs balancer -asService
2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
running service
2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
hadoop-metrics2.properties
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
period at 10 second(s).
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
started
2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 14:58:15,702 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.15:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.16:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.14:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.13:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.248:10004
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B    
            0 B                  0  hdfs://prd
The cluster is balanced. Exiting...
2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round{code}
but subsequent rounds always fail with:
{code:java}
2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 15:03:15,934 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
balance work. Already tried 1 times
org.apache.hadoop.metrics2.MetricsException: Metrics source 
Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
    at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
    at 
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round {code}
and finally the process stops after retrying 5 times:
{code:java}
// code placeholder
{code}
Looking at the HDFS balancer code, I don't see a way to prevent this error, 
which is strange. A metrics source is created, with name "Balancer-" + 
blockpoolID, where blockpoolID is 'BP-1957577775-192.168.146.236-1720512277934' 
in our case.

When in servicemode, new balancers (and thus metrics sources) will be created 
in the doBalance method, while the service is running. 

When looking at the metrics implementation, duplicates are only allowed when 
the metrics system is being started. After correct initialization of the 
metrics system, duplicates are no longer allowed (monitoring flag is set to 
true).

  was:
When running the HDFS balancer as a service using the following command

 
{code:java}
hdfs balancer -asService {code}
the first balancing round succeeds:

 

 
{code:java}
$ hdfs balancer -asService
2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
running service
2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
hadoop-metrics2.properties
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
period at 10 second(s).
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
started
2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 14:58:15,702 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.15:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.16:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.14:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.13:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.248:10004
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B    
            0 B                  0  hdfs://prd
The cluster is balanced. Exiting...
2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round{code}
but subsequent rounds always fail with:

 

 
{code:java}
2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 15:03:15,934 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
balance work. Already tried 1 times
org.apache.hadoop.metrics2.MetricsException: Metrics source 
Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
    at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
    at 
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round {code}
and finally the process stops after retrying 5 times:

 

 
{code:java}
// code placeholder
{code}
Looking at the HDFS balancer code, I don't see a way to prevent this error, 
which is strange. A metrics source is created, with name "Balancer-" + 
blockpoolID, where blockpoolID is 'BP-1957577775-192.168.146.236-1720512277934' 
in our case.

 

When in servicemode, new balancers (and thus metrics sources) will be created 
in the doBalance method, while the service is running. 

When looking at the metrics implementation, duplicates are only allowed when 
the metrics system is being started. After correct initialization of the 
metrics system, duplicates are no longer allowed (monitoring flag is set to 
true).


> Unable to run HDFS balancer as a service
> ----------------------------------------
>
>                 Key: HDFS-17673
>                 URL: https://issues.apache.org/jira/browse/HDFS-17673
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer &amp; mover
>    Affects Versions: 3.4.0
>            Reporter: Dirk Daems
>            Priority: Major
>
> When running the HDFS balancer as a service using the following command
>  
> {code:java}
> hdfs balancer -asService {code}
> the first balancing round succeeds:
> {code:java}
> $ hdfs balancer -asService
> 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
> running service
> 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
> hadoop-metrics2.properties
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric 
> snapshot period at 10 second(s).
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
> started
> 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 14:58:15,702 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.15:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.16:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.14:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.13:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.248:10004
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
> Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B  
>               0 B                  0  hdfs://prd
> The cluster is balanced. Exiting...
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round{code}
> but subsequent rounds always fail with:
> {code:java}
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
> balance work. Already tried 1 times
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
>     at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round {code}
> and finally the process stops after retrying 5 times:
> {code:java}
> // code placeholder
> {code}
> Looking at the HDFS balancer code, I don't see a way to prevent this error, 
> which is strange. A metrics source is created, with name "Balancer-" + 
> blockpoolID, where blockpoolID is 
> 'BP-1957577775-192.168.146.236-1720512277934' in our case.
> When in servicemode, new balancers (and thus metrics sources) will be created 
> in the doBalance method, while the service is running. 
> When looking at the metrics implementation, duplicates are only allowed when 
> the metrics system is being started. After correct initialization of the 
> metrics system, duplicates are no longer allowed (monitoring flag is set to 
> true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to