[ 
https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909810#comment-17909810
 ] 

Jason Pyeron commented on HDFS-17673:
-------------------------------------

also occurs in 3.4.1.

running {{hdfs balancer}} results in (some details have been redacted)

 
{noformat}
[hadoop@NAMENODE hadoop]$ hdfs balancer
2025-01-04 20:58:20,649 INFO impl.MetricsConfig: Loaded properties from 
hadoop-metrics2.properties
2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot 
period at 10 second(s).
2025-01-04 20:58:20,897 INFO impl.MetricsSystemImpl: Balancer metrics system 
started
2025-01-04 20:58:20,941 INFO balancer.Balancer: namenodes  = 
[hdfs://NAMENODE:9000]
2025-01-04 20:58:20,945 INFO balancer.Balancer: parameters = 
Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
#blockpools = 0, run during upgrade = false, sort top nodes = false, hot block 
time interval = 0]
2025-01-04 20:58:20,945 INFO balancer.Balancer: included nodes = []
2025-01-04 20:58:20,945 INFO balancer.Balancer: excluded nodes = []
2025-01-04 20:58:20,945 INFO balancer.Balancer: source nodes = []
Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved  NameNode
2025-01-04 20:58:20,950 INFO balancer.NameNodeConnector: getBlocks calls for 
hdfs://NAMENODE:9000 will be rate-limited to 20 per second
2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2025-01-04 20:58:22,762 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2025-01-04 20:58:22,762 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2025-01-04 20:58:22,763 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2025-01-04 20:58:22,763 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2025-01-04 20:58:22,771 INFO balancer.Balancer: dfs.blocksize = 134217728 
(default=134217728)
2025-01-04 20:58:22,813 INFO net.NetworkTopology: Adding a new node: 
/default-rack/datanode-194:9866
...REDACTED...
2025-01-04 20:58:22,817 INFO net.NetworkTopology: Adding a new node: 
/default-rack/datanode-167:9866
2025-01-04 20:58:22,821 INFO balancer.Balancer: 5 over-utilized: 
[datanode-54:9866:DISK, datanode-15:9866:DISK, datanode-152:9866:DISK, 
datanode-179:9866:DISK, datanode-13:9866:DISK]
2025-01-04 20:58:22,821 INFO balancer.Balancer: 4 underutilized: 
[datanode-187:9866:DISK, datanode-236:9866:DISK, datanode-180:9866:DISK, 
datanode-179:9866:DISK]
2025-01-04 20:58:22,822 INFO balancer.Balancer: Need to move 167.84 GB to make 
the cluster balanced.
2025-01-04 20:58:22,840 INFO balancer.Balancer: chooseStorageGroups for 
SAME_RACK: overUtilized => underUtilized
2025-01-04 20:58:22,841 INFO balancer.Balancer: Decided to move 10 GB bytes 
from datanode-54:9866:DISK to datanode-187:9866:DISK
2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes 
from datanode-15:9866:DISK to datanode-236:9866:DISK
2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes 
from datanode-152:9866:DISK to datanode-180:9866:DISK
2025-01-04 20:58:22,842 INFO balancer.Balancer: Decided to move 10 GB bytes 
from datanode-179:9866:DISK to datanode-179:9866:DISK
2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for 
SAME_RACK: overUtilized => belowAvgUtilized
2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for 
SAME_RACK: underUtilized => aboveAvgUtilized
2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for 
ANY_OTHER: overUtilized => underUtilized
2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for 
ANY_OTHER: overUtilized => belowAvgUtilized
2025-01-04 20:58:22,842 INFO balancer.Balancer: chooseStorageGroups for 
ANY_OTHER: underUtilized => aboveAvgUtilized
2025-01-04 20:58:22,842 INFO balancer.Balancer: Will move 40 GB  in this 
iteration for NameNodeConnector[namenodeUri=hdfs://NAMENODE:9000, 
bpid=BP-391587423-datanode-118-1666125691850]
2025-01-04 20:58:22,842 INFO balancer.Balancer: Total target DataNodes in this 
iteration: 4
2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Balancer concurrent 
dispatcher threads = 4
2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Limiting threads per target 
to the specified max.
2025-01-04 20:58:22,843 INFO balancer.Dispatcher: Allocating 100 threads per 
target.
2025-01-04 20:58:22,891 INFO balancer.Dispatcher: Start moving 
blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving 
blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving 
blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,894 INFO balancer.Dispatcher: Start moving 
blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving 
blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving 
blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,893 INFO balancer.Dispatcher: Start moving 
blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,892 INFO balancer.Dispatcher: Start moving 
blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,896 INFO balancer.Dispatcher: Start moving 
blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving 
blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,912 INFO balancer.Dispatcher: Start moving 
blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving 
blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving 
blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving 
blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,911 INFO balancer.Dispatcher: Start moving 
blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving 
blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving 
blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,910 INFO balancer.Dispatcher: Start moving 
blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,909 INFO balancer.Dispatcher: Start moving 
blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:22,900 INFO balancer.Dispatcher: Start moving 
blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:34,187 INFO balancer.Dispatcher: Successfully moved 
blk_1073868156_127340 with size=52735307 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:35,134 INFO balancer.Dispatcher: Successfully moved 
blk_1073868014_127198 with size=59247100 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:35,423 INFO balancer.Dispatcher: Successfully moved 
blk_1073868125_127309 with size=60493448 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:35,504 INFO balancer.Dispatcher: Successfully moved 
blk_1073868068_127252 with size=61202594 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:35,571 INFO balancer.Dispatcher: Successfully moved 
blk_1073868124_127308 with size=61242733 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:35,588 INFO balancer.Dispatcher: Successfully moved 
blk_1073868152_127336 with size=61938955 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:44,991 INFO balancer.Dispatcher: Successfully moved 
blk_1073868031_127215 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:44,994 INFO balancer.Dispatcher: Successfully moved 
blk_1073868003_127187 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,007 INFO balancer.Dispatcher: Successfully moved 
blk_1073868056_127240 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,022 INFO balancer.Dispatcher: Successfully moved 
blk_1073868169_127353 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,109 INFO balancer.Dispatcher: Successfully moved 
blk_1073868187_127371 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,120 INFO balancer.Dispatcher: Successfully moved 
blk_1073868121_127305 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,133 INFO balancer.Dispatcher: Successfully moved 
blk_1073868005_127189 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,140 INFO balancer.Dispatcher: Successfully moved 
blk_1073868050_127234 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,149 INFO balancer.Dispatcher: Successfully moved 
blk_1073868085_127269 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,166 INFO balancer.Dispatcher: Successfully moved 
blk_1073868022_127206 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,180 INFO balancer.Dispatcher: Successfully moved 
blk_1073868132_127316 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,182 INFO balancer.Dispatcher: Successfully moved 
blk_1073868138_127322 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,183 INFO balancer.Dispatcher: Successfully moved 
blk_1073868104_127288 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:45,189 INFO balancer.Dispatcher: Successfully moved 
blk_1073868147_127331 with size=134217728 from datanode-152:9866:DISK to 
datanode-180:9866:DISK through datanode-152:9866
2025-01-04 20:58:46,033 INFO balancer.Dispatcher: Total bytes (blocks) moved in 
this iteration 2.08 GB (20)
Jan 4, 2025, 8:58:46 PM           0              2.08 GB           167.84 GB    
          40 GB                 20  hdfs://NAMENODE:9000
2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.namenode.get-blocks.max-qps 
= 20 (default=20)
2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
5400000 (default=5400000)
2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.moverThreads = 
1000 (default=1000)
2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.dispatcherThreads 
= 200 (default=200)
2025-01-04 20:58:55,054 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
2147483648 (default=2147483648)
2025-01-04 20:58:55,054 INFO balancer.Balancer: 
dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
2025-01-04 20:58:55,055 INFO balancer.Balancer: 
dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
2025-01-04 20:58:55,055 INFO balancer.Balancer: 
dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 
10737418240 (default=10737418240)
2025-01-04 20:58:55,055 INFO balancer.Balancer: dfs.blocksize = 134217728 
(default=134217728)
Jan 4, 2025, 8:58:55 PM  Balancing took 35.618 seconds
2025-01-04 20:58:55,732 ERROR balancer.Balancer: Exiting balancer due an 
exception
org.apache.hadoop.metrics2.MetricsException: Metrics source 
Balancer-BP-391587423-datanode-118-1666125691850 already exists!
        at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
        at 
org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
        at 
org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
        at 
org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:868)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
        at 
org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) 
{noformat}
 

 

> Unable to run HDFS balancer as a service
> ----------------------------------------
>
>                 Key: HDFS-17673
>                 URL: https://issues.apache.org/jira/browse/HDFS-17673
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer &amp; mover
>    Affects Versions: 3.4.0
>            Reporter: Dirk Daems
>            Priority: Major
>
> When running the HDFS balancer as a service using the following command
>  
> {code:java}
> hdfs balancer -asService {code}
> the first balancing round succeeds:
> {code:java}
> $ hdfs balancer -asService
> 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
> running service
> 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
> hadoop-metrics2.properties
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric 
> snapshot period at 10 second(s).
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
> started
> 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 14:58:15,702 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.15:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.16:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.14:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.13:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.248:10004
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
> Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B  
>               0 B                  0  hdfs://prd
> The cluster is balanced. Exiting...
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round{code}
> but subsequent rounds always fail with:
> {code:java}
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
> balance work. Already tried 1 times
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
>     at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round {code}
> and finally the process stops after retrying 5 times:
> {code:java}
> 2024-11-25 15:28:17,816 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 15:28:17,853 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 15:28:17,853 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 15:28:17,940 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> Nov 25, 2024, 3:28:17 PM Balancing took 30.0581 minutes
> 2024-11-25 15:28:17,948 ERROR balancer.Balancer: Exiting balancer due an 
> exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
>     at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> {code}
> Looking at the HDFS balancer code, I don't see a way to prevent this error, 
> which is strange. A metrics source is created, with name "Balancer-" + 
> blockpoolID, where blockpoolID is 
> 'BP-1957577775-192.168.146.236-1720512277934' in our case.
> When in servicemode, new balancers (and thus metrics sources) will be created 
> in the doBalance method, while the service is running. 
> When looking at the metrics implementation, duplicates are only allowed when 
> the metrics system is being started. After correct initialization of the 
> metrics system, duplicates are no longer allowed (monitoring flag is set to 
> true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to