[ https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909810#comment-17909810 ]
Jason Pyeron edited comment on HDFS-17673 at 1/5/25 2:36 AM: ------------------------------------------------------------- also occurs in 3.4.1. running {{hdfs balancer}} results in (some details have been redacted) {noformat} [hadoop@NAMENODE hadoop]$ hdfs balancer 2025-01-04 20:58:20,649 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Balancer metrics system started 2025-01-04 20:58:20,941 INFO balancer.Balancer: namenodes = [hdfs://NAMENODE:9000] 2025-01-04 20:58:20,945 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false, sort top nodes = false, hot block time interval = 0] 2025-01-04 20:58:20,945 INFO balancer.Balancer: included nodes = [] 2025-01-04 20:58:20,945 INFO balancer.Balancer: excluded nodes = [] 2025-01-04 20:58:20,945 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved NameNode 2025-01-04 20:58:20,950 INFO balancer.NameNodeConnector: getBlocks calls for hdfs://NAMENODE:9000 will be rate-limited to 20 per second 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2025-01-04 20:58:22,763 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2025-01-04 20:58:22,763 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 2025-01-04 20:58:22,813 INFO net.NetworkTopology: Adding a new node: /default-rack/datanode-194:9866 ...REDACTED... 2025-01-04 20:58:22,817 INFO net.NetworkTopology: Adding a new node: /default-rack/datanode-167:9866 2025-01-04 20:58:22,821 INFO balancer.Balancer: 5 over-utilized: [datanode-54:9866:DISK, datanode-15:9866:DISK, datanode-152:9866:DISK, datanode-179:9866:DISK, datanode-13:9866:DISK] 2025-01-04 20:58:22,821 INFO balancer.Balancer: 4 underutilized: [datanode-187:9866:DISK, datanode-236:9866:DISK, datanode-180:9866:DISK, datanode-179:9866:DISK] 2025-01-04 20:58:22,822 INFO balancer.Balancer: Need to move 167.84 GB to make the cluster balanced. 2025-01-04 20:58:22,840 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: overUtilized => underUtilized 2025-01-04 20:58:22,841 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-54:9866:DISK to datanode-187:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-15:9866:DISK to datanode-236:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-152:9866:DISK to datanode-180:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-179:9866:DISK to datanode-179:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: Will move 40 GB in this iteration for NameNodeConnector[namenodeUri=hdfs://NAMENODE:9000, bpid=BP-391587423-datanode-118-1666125691850] 2025-01-04 20:58:22,842 INFO balancer.Balancer: Total target DataNodes in this iteration: 4 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Balancer concurrent dispatcher threads = 4 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Limiting threads per target to the specified max. 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Allocating 100 threads per target. 2025-01-04 20:58:22,891 INFO balancer.Dispatcher: Start moving blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,892 INFO balancer.Dispatcher: Start moving blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,896 INFO balancer.Dispatcher: Start moving blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,909 INFO balancer.Dispatcher: Start moving blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,900 INFO balancer.Dispatcher: Start moving blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:34,187 INFO balancer.Dispatcher: Successfully moved blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,134 INFO balancer.Dispatcher: Successfully moved blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,423 INFO balancer.Dispatcher: Successfully moved blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,504 INFO balancer.Dispatcher: Successfully moved blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,571 INFO balancer.Dispatcher: Successfully moved blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,588 INFO balancer.Dispatcher: Successfully moved blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:44,991 INFO balancer.Dispatcher: Successfully moved blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:44,994 INFO balancer.Dispatcher: Successfully moved blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,007 INFO balancer.Dispatcher: Successfully moved blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,022 INFO balancer.Dispatcher: Successfully moved blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,109 INFO balancer.Dispatcher: Successfully moved blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,120 INFO balancer.Dispatcher: Successfully moved blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,133 INFO balancer.Dispatcher: Successfully moved blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,140 INFO balancer.Dispatcher: Successfully moved blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,149 INFO balancer.Dispatcher: Successfully moved blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,166 INFO balancer.Dispatcher: Successfully moved blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,180 INFO balancer.Dispatcher: Successfully moved blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,182 INFO balancer.Dispatcher: Successfully moved blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,183 INFO balancer.Dispatcher: Successfully moved blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,189 INFO balancer.Dispatcher: Successfully moved blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:46,033 INFO balancer.Dispatcher: Total bytes (blocks) moved in this iteration 2.08 GB (20) Jan 4, 2025, 8:58:46 PM 0 2.08 GB 167.84 GB 40 GB 20 hdfs://NAMENODE:9000 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) Jan 4, 2025, 8:58:55 PM Balancing took 35.618 seconds 2025-01-04 20:58:55,732 ERROR balancer.Balancer: Exiting balancer due an exception org.apache.hadoop.metrics2.MetricsException: Metrics source Balancer-BP-391587423-datanode-118-1666125691850 already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) at org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:868) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) {noformat} The "duplicate" source name is based on a decommissioned datanode, which is still listed in etc/hadoop/dfs.hosts.exclude.conf . Tracking internally as [PDINC-11774|https://2gb.intranet.pdinc.us/pdinc/dte/prj/openproject/work_packages/11774/activity]. was (Author: jpyeron): also occurs in 3.4.1. running {{hdfs balancer}} results in (some details have been redacted) {noformat} [hadoop@NAMENODE hadoop]$ hdfs balancer 2025-01-04 20:58:20,649 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties 2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Balancer metrics system started 2025-01-04 20:58:20,941 INFO balancer.Balancer: namenodes = [hdfs://NAMENODE:9000] 2025-01-04 20:58:20,945 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false, sort top nodes = false, hot block time interval = 0] 2025-01-04 20:58:20,945 INFO balancer.Balancer: included nodes = [] 2025-01-04 20:58:20,945 INFO balancer.Balancer: excluded nodes = [] 2025-01-04 20:58:20,945 INFO balancer.Balancer: source nodes = [] Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved NameNode 2025-01-04 20:58:20,950 INFO balancer.NameNodeConnector: getBlocks calls for hdfs://NAMENODE:9000 will be rate-limited to 20 per second 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2025-01-04 20:58:22,763 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2025-01-04 20:58:22,763 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 2025-01-04 20:58:22,813 INFO net.NetworkTopology: Adding a new node: /default-rack/datanode-194:9866 ...REDACTED... 2025-01-04 20:58:22,817 INFO net.NetworkTopology: Adding a new node: /default-rack/datanode-167:9866 2025-01-04 20:58:22,821 INFO balancer.Balancer: 5 over-utilized: [datanode-54:9866:DISK, datanode-15:9866:DISK, datanode-152:9866:DISK, datanode-179:9866:DISK, datanode-13:9866:DISK] 2025-01-04 20:58:22,821 INFO balancer.Balancer: 4 underutilized: [datanode-187:9866:DISK, datanode-236:9866:DISK, datanode-180:9866:DISK, datanode-179:9866:DISK] 2025-01-04 20:58:22,822 INFO balancer.Balancer: Need to move 167.84 GB to make the cluster balanced. 2025-01-04 20:58:22,840 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: overUtilized => underUtilized 2025-01-04 20:58:22,841 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-54:9866:DISK to datanode-187:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-15:9866:DISK to datanode-236:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-152:9866:DISK to datanode-180:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes from datanode-179:9866:DISK to datanode-179:9866:DISK 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized 2025-01-04 20:58:22,842 INFO balancer.Balancer: Will move 40 GB in this iteration for NameNodeConnector[namenodeUri=hdfs://NAMENODE:9000, bpid=BP-391587423-datanode-118-1666125691850] 2025-01-04 20:58:22,842 INFO balancer.Balancer: Total target DataNodes in this iteration: 4 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Balancer concurrent dispatcher threads = 4 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Limiting threads per target to the specified max. 2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Allocating 100 threads per target. 2025-01-04 20:58:22,891 INFO balancer.Dispatcher: Start moving blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,892 INFO balancer.Dispatcher: Start moving blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,896 INFO balancer.Dispatcher: Start moving blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,909 INFO balancer.Dispatcher: Start moving blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:22,900 INFO balancer.Dispatcher: Start moving blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:34,187 INFO balancer.Dispatcher: Successfully moved blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,134 INFO balancer.Dispatcher: Successfully moved blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,423 INFO balancer.Dispatcher: Successfully moved blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,504 INFO balancer.Dispatcher: Successfully moved blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,571 INFO balancer.Dispatcher: Successfully moved blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:35,588 INFO balancer.Dispatcher: Successfully moved blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:44,991 INFO balancer.Dispatcher: Successfully moved blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:44,994 INFO balancer.Dispatcher: Successfully moved blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,007 INFO balancer.Dispatcher: Successfully moved blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,022 INFO balancer.Dispatcher: Successfully moved blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,109 INFO balancer.Dispatcher: Successfully moved blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,120 INFO balancer.Dispatcher: Successfully moved blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,133 INFO balancer.Dispatcher: Successfully moved blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,140 INFO balancer.Dispatcher: Successfully moved blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,149 INFO balancer.Dispatcher: Successfully moved blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,166 INFO balancer.Dispatcher: Successfully moved blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,180 INFO balancer.Dispatcher: Successfully moved blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,182 INFO balancer.Dispatcher: Successfully moved blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,183 INFO balancer.Dispatcher: Successfully moved blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:45,189 INFO balancer.Dispatcher: Successfully moved blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to datanode-180:9866:DISK through datanode-152:9866 2025-01-04 20:58:46,033 INFO balancer.Dispatcher: Total bytes (blocks) moved in this iteration 2.08 GB (20) Jan 4, 2025, 8:58:46 PM 0 2.08 GB 167.84 GB 40 GB 20 hdfs://NAMENODE:9000 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps = 20 (default=20) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 100 (default=100) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) Jan 4, 2025, 8:58:55 PM Balancing took 35.618 seconds 2025-01-04 20:58:55,732 ERROR balancer.Balancer: Exiting balancer due an exception org.apache.hadoop.metrics2.MetricsException: Metrics source Balancer-BP-391587423-datanode-118-1666125691850 already exists! at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) at org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) at org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) at org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:868) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) {noformat} Tracking internally as [PDINC-11774|https://2gb.intranet.pdinc.us/pdinc/dte/prj/openproject/work_packages/11774/activity]. > Unable to run HDFS balancer as a service > ---------------------------------------- > > Key: HDFS-17673 > URL: https://issues.apache.org/jira/browse/HDFS-17673 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover > Affects Versions: 3.4.0 > Reporter: Dirk Daems > Priority: Major > > When running the HDFS balancer as a service using the following command > > {code:java} > hdfs balancer -asService {code} > the first balancing round succeeds: > {code:java} > $ hdfs balancer -asService > 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long > running service > 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from > hadoop-metrics2.properties > 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric > snapshot period at 10 second(s). > 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system > started > 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 14:58:15,702 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.15:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.16:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.14:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.13:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.248:10004 > 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: [] > 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: [] > Nov 25, 2024, 2:58:15 PM 0 0 B 0 B > 0 B 0 hdfs://prd > The cluster is balanced. Exiting... > 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed! > 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait > for 5.0 minutes for next round{code} > but subsequent rounds always fail with: > {code:java} > 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = [] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = [] > 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 15:03:15,935 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 15:03:15,935 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do > balance work. Already tried 1 times > org.apache.hadoop.metrics2.MetricsException: Metrics source > Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) > at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) > 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait > for 5.0 minutes for next round {code} > and finally the process stops after retrying 5 times: > {code:java} > 2024-11-25 15:28:17,816 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: included nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: source nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 15:28:17,853 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 15:28:17,853 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 15:28:17,940 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > Nov 25, 2024, 3:28:17 PM Balancing took 30.0581 minutes > 2024-11-25 15:28:17,948 ERROR balancer.Balancer: Exiting balancer due an > exception > org.apache.hadoop.metrics2.MetricsException: Metrics source > Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) > at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) > {code} > Looking at the HDFS balancer code, I don't see a way to prevent this error, > which is strange. A metrics source is created, with name "Balancer-" + > blockpoolID, where blockpoolID is > 'BP-1957577775-192.168.146.236-1720512277934' in our case. > When in servicemode, new balancers (and thus metrics sources) will be created > in the doBalance method, while the service is running. > When looking at the metrics implementation, duplicates are only allowed when > the metrics system is being started. After correct initialization of the > metrics system, duplicates are no longer allowed (monitoring flag is set to > true). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org