[
https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909815#comment-17909815
]
Jason Pyeron commented on HDFS-17673:
-------------------------------------
Ok, the issue was actually HDFS-8824 for the "No block has been moved for 5
iterations. Exiting..."
Better solution was to add
{{DefaultMetricsSystem.instance().unregisterSource(metrics.getName());}} to the
end of {{org.apache.hadoop.hdfs.server.balancer.Balancer.resetData().}}
will send a pull request
> Unable to run HDFS balancer as a service
> ----------------------------------------
>
> Key: HDFS-17673
> URL: https://issues.apache.org/jira/browse/HDFS-17673
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer & mover
> Affects Versions: 3.4.0
> Reporter: Dirk Daems
> Priority: Major
>
> When running the HDFS balancer as a service using the following command
>
> {code:java}
> hdfs balancer -asService {code}
> the first balancing round succeeds:
> {code:java}
> $ hdfs balancer -asService
> 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long
> running service
> 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from
> hadoop-metrics2.properties
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric
> snapshot period at 10 second(s).
> 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system
> started
> 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes = [hdfs://prd]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters =
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0,
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot
> block time interval = 0]
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = []
> 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will
> login using keytab.
> 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto
> renewal enabled : false
> Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
> Bytes Being Moved NameNode
> 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key
> range: [0, 2147483647)
> 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every
> 2hrs, 30mins, 0sec
> 2024-11-25 14:58:15,702 INFO balancer.Balancer:
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth =
> 5400000 (default=5400000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads =
> 1000 (default=1000)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer:
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size =
> 2147483648 (default=2147483648)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer:
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer:
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 14:58:15,703 INFO balancer.Balancer:
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move
> = 10737418240 (default=10737418240)
> 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456
> (default=134217728)
> 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node:
> /LeibnizRack17/192.168.146.15:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
> /LeibnizRack17/192.168.146.16:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
> /LeibnizRack17/192.168.146.14:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
> /LeibnizRack17/192.168.146.13:10004
> 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node:
> /LeibnizRack17/192.168.146.248:10004
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: []
> 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: []
> Nov 25, 2024, 2:58:15 PM 0 0 B 0 B
> 0 B 0 hdfs://prd
> The cluster is balanced. Exiting...
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed!
> 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait
> for 5.0 minutes for next round{code}
> but subsequent rounds always fail with:
> {code:java}
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes = [hdfs://prd]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters =
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0,
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot
> block time interval = 0]
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will
> login using keytab.
> 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto
> renewal enabled : false
> Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
> Bytes Being Moved NameNode
> 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key
> range: [0, 2147483647)
> 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every
> 2hrs, 30mins, 0sec
> 2024-11-25 15:03:15,934 INFO balancer.Balancer:
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth =
> 5400000 (default=5400000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads =
> 1000 (default=1000)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer:
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size =
> 2147483648 (default=2147483648)
> 2024-11-25 15:03:15,934 INFO balancer.Balancer:
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer:
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:03:15,935 INFO balancer.Balancer:
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move
> = 10737418240 (default=10737418240)
> 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456
> (default=134217728)
> 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do
> balance work. Already tried 1 times
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
> at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait
> for 5.0 minutes for next round {code}
> and finally the process stops after retrying 5 times:
> {code:java}
> 2024-11-25 15:28:17,816 INFO balancer.Balancer: namenodes = [hdfs://prd]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: parameters =
> Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle
> iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0,
> #blockpools = 0, run during upgrade = false, sort top nodes = false, hot
> block time interval = 0]
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: included nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: excluded nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: source nodes = []
> 2024-11-25 15:28:17,817 INFO balancer.Balancer: Keytab is configured, will
> login using keytab.
> 2024-11-25 15:28:17,853 INFO security.UserGroupInformation: Login successful
> for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto
> renewal enabled : false
> Time Stamp Iteration# Bytes Already Moved Bytes Left To Move
> Bytes Being Moved NameNode
> 2024-11-25 15:28:17,853 INFO balancer.NameNodeConnector: getBlocks calls for
> hdfs://prd will be rate-limited to 20 per second
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Block token params received
> from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Block token key
> range: [0, 2147483647)
> 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Update block keys every
> 2hrs, 30mins, 0sec
> 2024-11-25 15:28:17,938 INFO balancer.Balancer:
> dfs.namenode.get-blocks.max-qps = 20 (default=20)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.movedWinWidth =
> 5400000 (default=5400000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.moverThreads =
> 1000 (default=1000)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer:
> dfs.balancer.dispatcherThreads = 200 (default=200)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.getBlocks.size =
> 2147483648 (default=2147483648)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer:
> dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer:
> dfs.datanode.balance.max.concurrent.moves = 100 (default=100)
> 2024-11-25 15:28:17,938 INFO balancer.Balancer:
> dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.balancer.max-size-to-move
> = 10737418240 (default=10737418240)
> 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.blocksize = 268435456
> (default=134217728)
> 2024-11-25 15:28:17,940 INFO block.BlockTokenSecretManager: Setting block
> keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 .
> Nov 25, 2024, 3:28:17 PM Balancing took 30.0581 minutes
> 2024-11-25 15:28:17,948 ERROR balancer.Balancer: Exiting balancer due an
> exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source
> Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists!
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
> at
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
> at
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
> at
> org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824)
> at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
> at
> org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133)
> {code}
> Looking at the HDFS balancer code, I don't see a way to prevent this error,
> which is strange. A metrics source is created, with name "Balancer-" +
> blockpoolID, where blockpoolID is
> 'BP-1957577775-192.168.146.236-1720512277934' in our case.
> When in servicemode, new balancers (and thus metrics sources) will be created
> in the doBalance method, while the service is running.
> When looking at the metrics implementation, duplicates are only allowed when
> the metrics system is being started. After correct initialization of the
> metrics system, duplicates are no longer allowed (monitoring flag is set to
> true).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org