[ https://issues.apache.org/jira/browse/HDFS-17673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909815#comment-17909815 ]
Jason Pyeron commented on HDFS-17673: ------------------------------------- Ok, the issue was actually HDFS-8824 for the "No block has been moved for 5 iterations. Exiting..." Better solution was to add {{DefaultMetricsSystem.instance().unregisterSource(metrics.getName());}} to the end of {{org.apache.hadoop.hdfs.server.balancer.Balancer.resetData().}} will send a pull request > Unable to run HDFS balancer as a service > ---------------------------------------- > > Key: HDFS-17673 > URL: https://issues.apache.org/jira/browse/HDFS-17673 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover > Affects Versions: 3.4.0 > Reporter: Dirk Daems > Priority: Major > > When running the HDFS balancer as a service using the following command > > {code:java} > hdfs balancer -asService {code} > the first balancing round succeeds: > {code:java} > $ hdfs balancer -asService > 2024-11-25 14:58:14,712 INFO balancer.Balancer: Balancer will run as a long > running service > 2024-11-25 14:58:14,764 INFO impl.MetricsConfig: Loaded properties from > hadoop-metrics2.properties > 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Scheduled Metric > snapshot period at 10 second(s). > 2024-11-25 14:58:14,824 INFO impl.MetricsSystemImpl: Balancer metrics system > started > 2024-11-25 14:58:14,840 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: included nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: source nodes = [] > 2024-11-25 14:58:14,841 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 14:58:14,955 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 14:58:14,956 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 14:58:15,499 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 14:58:15,501 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 14:58:15,501 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 14:58:15,702 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 14:58:15,703 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 14:58:15,705 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 14:58:15,708 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 14:58:15,732 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.15:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.16:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.14:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.13:10004 > 2024-11-25 14:58:15,733 INFO net.NetworkTopology: Adding a new node: > /LeibnizRack17/192.168.146.248:10004 > 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 over-utilized: [] > 2024-11-25 14:58:15,734 INFO balancer.Balancer: 0 underutilized: [] > Nov 25, 2024, 2:58:15 PM 0 0 B 0 B > 0 B 0 hdfs://prd > The cluster is balanced. Exiting... > 2024-11-25 14:58:15,751 INFO balancer.Balancer: Balance succeed! > 2024-11-25 14:58:15,751 INFO balancer.Balancer: Finished one round, will wait > for 5.0 minutes for next round{code} > but subsequent rounds always fail with: > {code:java} > 2024-11-25 15:03:15,751 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: included nodes = [] > 2024-11-25 15:03:15,751 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 15:03:15,752 INFO balancer.Balancer: source nodes = [] > 2024-11-25 15:03:15,752 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 15:03:15,792 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 15:03:15,792 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 15:03:15,909 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:03:15,909 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 15:03:15,934 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 15:03:15,935 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 15:03:15,935 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 15:03:15,936 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 15:03:15,936 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 15:03:15,945 WARN balancer.Balancer: Encounter exception while do > balance work. Already tried 1 times > org.apache.hadoop.metrics2.MetricsException: Metrics source > Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) > at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) > 2024-11-25 15:03:15,946 INFO balancer.Balancer: Finished one round, will wait > for 5.0 minutes for next round {code} > and finally the process stops after retrying 5 times: > {code:java} > 2024-11-25 15:28:17,816 INFO balancer.Balancer: namenodes = [hdfs://prd] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: parameters = > Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle > iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, > #blockpools = 0, run during upgrade = false, sort top nodes = false, hot > block time interval = 0] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: included nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: excluded nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: source nodes = [] > 2024-11-25 15:28:17,817 INFO balancer.Balancer: Keytab is configured, will > login using keytab. > 2024-11-25 15:28:17,853 INFO security.UserGroupInformation: Login successful > for user h...@vgt.vito.be using keytab file hdfs.headless.keytab. Keytab auto > renewal enabled : false > Time Stamp Iteration# Bytes Already Moved Bytes Left To Move > Bytes Being Moved NameNode > 2024-11-25 15:28:17,853 INFO balancer.NameNodeConnector: getBlocks calls for > hdfs://prd will be rate-limited to 20 per second > 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Block token params received > from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec > 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Block token key > range: [0, 2147483647) > 2024-11-25 15:28:17,917 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > 2024-11-25 15:28:17,917 INFO balancer.KeyManager: Update block keys every > 2hrs, 30mins, 0sec > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.namenode.get-blocks.max-qps = 20 (default=20) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.movedWinWidth = > 5400000 (default=5400000) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.moverThreads = > 1000 (default=1000) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.balancer.dispatcherThreads = 200 (default=200) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: dfs.balancer.getBlocks.size = > 2147483648 (default=2147483648) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.datanode.balance.max.concurrent.moves = 100 (default=100) > 2024-11-25 15:28:17,938 INFO balancer.Balancer: > dfs.datanode.balance.bandwidthPerSec = 104857600 (default=104857600) > 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.balancer.max-size-to-move > = 10737418240 (default=10737418240) > 2024-11-25 15:28:17,940 INFO balancer.Balancer: dfs.blocksize = 268435456 > (default=134217728) > 2024-11-25 15:28:17,940 INFO block.BlockTokenSecretManager: Setting block > keys. BlockPool = BP-1957577775-192.168.146.236-1720512277934 . > Nov 25, 2024, 3:28:17 PM Balancing took 30.0581 minutes > 2024-11-25 15:28:17,948 ERROR balancer.Balancer: Exiting balancer due an > exception > org.apache.hadoop.metrics2.MetricsException: Metrics source > Balancer-BP-1957577775-192.168.146.236-1720512277934 already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.balancer.BalancerMetrics.create(BalancerMetrics.java:52) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.<init>(Balancer.java:362) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.doBalance(Balancer.java:824) > at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:887) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:975) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:1133) > {code} > Looking at the HDFS balancer code, I don't see a way to prevent this error, > which is strange. A metrics source is created, with name "Balancer-" + > blockpoolID, where blockpoolID is > 'BP-1957577775-192.168.146.236-1720512277934' in our case. > When in servicemode, new balancers (and thus metrics sources) will be created > in the doBalance method, while the service is running. > When looking at the metrics implementation, duplicates are only allowed when > the metrics system is being started. After correct initialization of the > metrics system, duplicates are no longer allowed (monitoring flag is set to > true). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org