Dirk Daems created HDFS-17673:
---------------------------------

             Summary: Unable to run HDFS balancer as a service
                 Key: HDFS-17673
                 URL: https://issues.apache.org/jira/browse/HDFS-17673
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: balancer & mover
    Affects Versions: 3.4.0
            Reporter: Dirk Daems


When running the HDFS balancer as a service using the following command

 
{code:java}
hdfs balancer -asService {code}
the first balancing round succeeds:

 

 
{code:java}
$ hdfs balancer -asService
2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
running service
2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
hadoop-metrics2.properties
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
period at 10 second(s).
2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
started
2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 14:58:15,702 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 14:58:15,703 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.15:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.16:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.14:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.13:10004
2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
/LeibnizRack17/192.168.146.248:10004
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B    
            0 B                  0  hdfs://prd
The cluster is balanced. Exiting...
2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round{code}
but subsequent rounds always fail with:

 

 
{code:java}
2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
login using keytab.
2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
renewal enabled : false
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://prd will be rate-limited to 20 per second
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
range: [0, 2147483647)
2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 2hrs, 
30mins, 0sec
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2024-11-25 15:03:15,934 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2024-11-25 15:03:15,935 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block keys. 
BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
(default=134217728)
2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
balance work. Already tried 1 times
org.apache.hadoop.metrics2.MetricsException: Metrics source 
Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
    at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
    at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
    at 
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
    at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
for 5.0 minutes for next round {code}
and finally the process stops after retrying 5 times:

 

 
{code:java}
// code placeholder
{code}
Looking at the HDFS balancer code, I don't see a way to prevent this error, 
which is strange. A metrics source is created, with name "Balancer-" + 
blockpoolID, where blockpoolID is 'BP-1957577775-192.168.146.236-1720512277934' 
in our case.

 

When in servicemode, new balancers (and thus metrics sources) will be created 
in the doBalance method, while the service is running. 

When looking at the metrics implementation, duplicates are only allowed when 
the metrics system is being started. After correct initialization of the 
metrics system, duplicates are no longer allowed (monitoring flag is set to 
true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to