[ 
https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909816#comment-17909816
 ] 

ASF GitHub Bot commented on HDFS-17673:
---------------------------------------

pdinc-oss opened a new pull request, #7270:
URL: https://github.com/apache/hadoop/pull/7270

   one liner
   
   using in production
   
   Apache committer agreement on file...
   
   Please feel free to be aggressive on feedback.




> Unable to run HDFS balancer as a service
> ----------------------------------------
>
>                 Key: HDFS-17673
>                 URL: https://issues.apache.org/jira/browse/HDFS-17673
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 3.4.0
>            Reporter: Dirk Daems
>            Priority: Major
>
> When running the HDFS balancer as a service using the following command
>  
> {code:java}
> hdfs balancer -asService {code}
> the first balancing round succeeds:
> {code:java}
> $ hdfs balancer -asService
> 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long 
> running service
> 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from 
> hadoop-metrics2.properties
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric 
> snapshot period at 10 second(s).
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system 
> started
> 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 14:58:15,702 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.15:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.16:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.14:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.13:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: 
> /LeibnizRack17/192.168.146.248:10004
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
> Nov 25, 2024, 2:58:15 PM          0                  0 B                 0 B  
>               0 B                  0  hdfs://prd
> The cluster is balanced. Exiting...
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round{code}
> but subsequent rounds always fail with:
> {code:java}
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do 
> balance work. Already tried 1 times
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
>     at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait 
> for 5.0 minutes for next round {code}
> and finally the process stops after retrying 5 times:
> {code:java}
> 2024-11-25 15:28:17,816 INFO balancer.Balancer: namenodes  = [hdfs://prd]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: parameters = 
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle 
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, 
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot 
> block time interval = 0]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: Keytab is configured, will 
> login using keytab.
> 2024-11-25 15:28:17,853 INFO security.UserGroupInformation: Login successful 
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto 
> renewal enabled : false
> Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  
> Bytes Being Moved  NameNode
> 2024-11-25 15:28:17,853 INFO balancer.NameNodeConnector: getBlocks calls for 
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Block token params received 
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Block token key 
> range: [0, 2147483647)
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Update block keys every 
> 2hrs, 30mins, 0sec
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 
> 5400000 (default=5400000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.moverThreads = 
> 1000 (default=1000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 
> 2147483648 (default=2147483648)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: 
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.balancer.max-size-to-move 
> = 10737418240 (default=10737418240)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.blocksize = 268435456 
> (default=134217728)
> 2024-11-25 15:28:17,940 INFO block.BlockTokenSecretManager: Setting block 
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> Nov 25, 2024, 3:28:17 PM Balancing took 30.0581 minutes
> 2024-11-25 15:28:17,948 ERROR balancer.Balancer: Exiting balancer due an 
> exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>     at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>     at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>     at 
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
>     at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>     at 
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> {code}
> Looking at the HDFS balancer code, I don't see a way to prevent this error, 
> which is strange. A metrics source is created, with name "Balancer-" + 
> blockpoolID, where blockpoolID is 
> 'BP-1957577775-192.168.146.236-1720512277934' in our case.
> When in servicemode, new balancers (and thus metrics sources) will be created 
> in the doBalance method, while the service is running. 
> When looking at the metrics implementation, duplicates are only allowed when 
> the metrics system is being started. After correct initialization of the 
> metrics system, duplicates are no longer allowed (monitoring flag is set to 
> true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to