[ https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649458#comment-17649458 ]
ASF GitHub Bot commented on HDFS-16867: --------------------------------------- Jing9 commented on code in PR #5203: URL: https://github.com/apache/hadoop/pull/5203#discussion_r1052583684 ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java: ########## @@ -161,6 +162,7 @@ public static void checkOtherInstanceRunning(boolean toCheck) { private final Path idPath; private OutputStream out; private final List<Path> targetPaths; + private final MoverMetrics moverMetrics; Review Comment: NameNodeConnector will also be used by Balancer, while MoverMetrics is only used by Mover. So not sure if placing MoverMetrics directly in NameNodeConnector is a good way from the semantic perspective. ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java: ########## @@ -160,7 +160,7 @@ Collections.<String> emptySet(), movedWinWidth, moverThreads, 0, BlockStoragePolicySuite.ID_BIT_LENGTH]; this.excludedPinnedBlocks = excludedPinnedBlocks; this.nnc = nnc; - this.metrics = MoverMetrics.create(this); Review Comment: If the main issue if the potential naming conflict caused by multiple mover instances, can we track the existing MoverMetrics instances and their NNC mappings at the class level (i.e. through a class static field) to avoid the duplication? ########## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java: ########## @@ -160,7 +160,7 @@ Collections.<String> emptySet(), movedWinWidth, moverThreads, 0, BlockStoragePolicySuite.ID_BIT_LENGTH]; this.excludedPinnedBlocks = excludedPinnedBlocks; this.nnc = nnc; - this.metrics = MoverMetrics.create(this); + this.metrics = nnc.getMoverMetrics(); Review Comment: We also need to add some UTs to reproduce the issue (without your fix) and validate the fix. > Exiting Mover due to an exception in MoverMetrics.create > -------------------------------------------------------- > > Key: HDFS-16867 > URL: https://issues.apache.org/jira/browse/HDFS-16867 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: ZhiWei Shi > Assignee: ZhiWei Shi > Priority: Major > Labels: pull-request-available > > After the Mover process is started for a period of time, the process exits > unexpectedly and an error is reported in the log > {code:java} > [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p > /test-mover-jira9534 > mover.log.jira9534.20221209.2 & > [hdfs@{hostname} hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2 > ... > 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved > blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to > ${ip1}:800:ARCHIVE through ${ip2}:800 > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics > system... > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped. > 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown > complete. > Dec 9, 2022, 2:22:42 PM Mover took 13mins, 19sec > 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception > org.apache.hadoop.metrics2.MetricsException: Metrics source > Mover-${BlockpoolID} already exists! > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152) > at > org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125) > at > org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229) > at > org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49) > at org.apache.hadoop.hdfs.server.mover.Mover.<init>(Mover.java:162) > at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684) > at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) > at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) > {code} > 1、“final ExitStatus r = m.run()”return only after scheduled one of replica > 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove() > 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple > times for the same nnc,which leads to the error > {code:java} > //Mover.java > for (final StorageType t : diff.existing) { > for (final MLocation ml : locations) { > final Source source = storages.getSource(ml); > if (ml.storageType == t && source != null) { > // try to schedule one replica move. > if (scheduleMoveReplica(db, source, diff.expected)) { // 1、return only > after scheduled one of replica > return true; > } > } > } > } > while (connectors.size() > 0) { > Collections.shuffle(connectors); > Iterator<NameNodeConnector> iter = connectors.iterator(); > while (iter.hasNext()) { > NameNodeConnector nnc = iter.next(); > //3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” > multiple times for the same nnc,which leads to the error > final Mover m = new Mover(nnc, conf, retryCount, > excludedPinnedBlocks); > final ExitStatus r = m.run(); > if (r == ExitStatus.SUCCESS) { // 2、r ==ExitStatus.IN_PROGRESS,won’t run > iter.remove() > IOUtils.cleanupWithLogger(LOG, nnc); > iter.remove(); > } {code} > Probably, we should initialize movermetrics when we initialize nnc -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org