Dirk Daems created HDFS-17673: --------------------------------- Summary: Unable to run HDFS balancer as a service Key: HDFS-17673 URL: https://issues.apache.org/jira/browse/HDFS-17673 Project: Hadoop HDFS Issue Type: Bug Components: balancer & mover Affects Versions: 3.4.0 Reporter: Dirk Daems
When running the HDFS balancer as a service using the following command {code:java} hdfs balancer -asService {code} the first balancing round succeeds: {code:java} $ hdfs balancer -asService 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long running service 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system started 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes = [hdfs://prd] 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false, sort top nodes = false, hot block time interval = 0] 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = [] 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = [] 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = [] 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will login using keytab. 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto renewal enabled : false Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved NameNode 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for hdfs://prd will be rate-limited to 20 per second 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key range: [0, 2147483647) 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 2024-11-25 14:58:15,702 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 (default=134217728) 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: /LeibnizRack17/192.168.146.15:10004 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: /LeibnizRack17/192.168.146.16:10004 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: /LeibnizRack17/192.168.146.14:10004 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: /LeibnizRack17/192.168.146.13:10004 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: /LeibnizRack17/192.168.146.248:10004 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: [] 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: [] Nov 25, 2024, 2:58:15 PM 0 0 B 0 B 0 B 0 hdfs://prd The cluster is balanced. Exiting... 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed! 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait for 5.0 minutes for next round{code} but subsequent rounds always fail with: {code:java} 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes = [hdfs://prd] 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false, sort top nodes = false, hot block time interval = 0] 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = [] 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = [] 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = [] 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will login using keytab. 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto renewal enabled : false Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved NameNode 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for hdfs://prd will be rate-limited to 20 per second 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key range: [0, 2147483647) 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2024-11-25 15:03:15,935 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2024-11-25 15:03:15,935 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 (default=134217728) 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do balance work. Already tried 1 times org.apache.hadoop.metrics2.MetricsException: Metrics source Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) at org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait for 5.0 minutes for next round {code} and finally the process stops after retrying 5 times: {code:java} // code placeholder {code} Looking at the HDFS balancer code, I don't see a way to prevent this error, which is strange. A metrics source is created, with name "Balancer-" + blockpoolID, where blockpoolID is 'BP-1957577775-192.168.146.236-1720512277934' in our case. When in servicemode, new balancers (and thus metrics sources) will be created in the doBalance method, while the service is running. When looking at the metrics implementation, duplicates are only allowed when the metrics system is being started. After correct initialization of the metrics system, duplicates are no longer allowed (monitoring flag is set to true). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org