[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create

ASF GitHub Bot (Jira) Mon, 19 Dec 2022 12:31:47 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-16867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17649458#comment-17649458
 ]


ASF GitHub Bot commented on HDFS-16867:
---------------------------------------

Jing9 commented on code in PR #5203:
URL: https://github.com/apache/hadoop/pull/5203#discussion_r1052583684


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/NameNodeConnector.java:
##########
@@ -161,6 +162,7 @@ public static void checkOtherInstanceRunning(boolean 
toCheck) {
   private final Path idPath;
   private OutputStream out;
   private final List<Path> targetPaths;
+  private final MoverMetrics moverMetrics;

Review Comment:
   NameNodeConnector will also be used by Balancer, while MoverMetrics is only 
used by Mover. So not sure if placing MoverMetrics directly in 
NameNodeConnector is a good way from the semantic perspective.



##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java:
##########
@@ -160,7 +160,7 @@ Collections.<String> emptySet(), movedWinWidth, 
moverThreads, 0,
         BlockStoragePolicySuite.ID_BIT_LENGTH];
     this.excludedPinnedBlocks = excludedPinnedBlocks;
     this.nnc = nnc;
-    this.metrics = MoverMetrics.create(this);

Review Comment:
   If the main issue if the potential naming conflict caused by multiple mover 
instances, can we track the existing MoverMetrics instances and their NNC 
mappings at the class level (i.e. through a class static field) to avoid the 
duplication?



##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/mover/Mover.java:
##########
@@ -160,7 +160,7 @@ Collections.<String> emptySet(), movedWinWidth, 
moverThreads, 0,
         BlockStoragePolicySuite.ID_BIT_LENGTH];
     this.excludedPinnedBlocks = excludedPinnedBlocks;
     this.nnc = nnc;
-    this.metrics = MoverMetrics.create(this);
+    this.metrics = nnc.getMoverMetrics();

Review Comment:
   We also need to add some UTs to reproduce the issue (without your fix) and 
validate the fix.





> Exiting Mover due to an exception in MoverMetrics.create
> --------------------------------------------------------
>
>                 Key: HDFS-16867
>                 URL: https://issues.apache.org/jira/browse/HDFS-16867
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZhiWei Shi
>            Assignee: ZhiWei Shi
>            Priority: Major
>              Labels: pull-request-available
>
> After the Mover process is started for a period of time, the process exits 
> unexpectedly and an error is reported in the log
> {code:java}
> [hdfs@${hostname} hadoop-3.3.2-nn]$ nohup bin/hdfs mover -p 
> /test-mover-jira9534 > mover.log.jira9534.20221209.2 &
> [hdfs@{hostname}  hadoop-3.3.2-nn]$ tail -f mover.log.jira9534.20221209.2
> ...
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Start moving 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:32 INFO balancer.Dispatcher: Successfully moved 
> blk_1073911285_170466 with size=134217728 from 10.108.182.205:800:DISK to 
> ${ip1}:800:ARCHIVE through ${ip2}:800
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Stopping Mover metrics 
> system...
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system stopped.
> 22/12/09 14:22:42 INFO impl.MetricsSystemImpl: Mover metrics system shutdown 
> complete.
> Dec 9, 2022, 2:22:42 PM  Mover took 13mins, 19sec
> 22/12/09 14:22:42 ERROR mover.Mover: Exiting Mover due to an exception
> org.apache.hadoop.metrics2.MetricsException: Metrics source 
> Mover-${BlockpoolID} already exists!
>         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.newSourceName(DefaultMetricsSystem.java:152)
>         at 
> org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.sourceName(DefaultMetricsSystem.java:125)
>         at 
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl.register(MetricsSystemImpl.java:229)
>         at 
> org.apache.hadoop.hdfs.server.mover.MoverMetrics.create(MoverMetrics.java:49)
>         at org.apache.hadoop.hdfs.server.mover.Mover.<init>(Mover.java:162)
>         at org.apache.hadoop.hdfs.server.mover.Mover.run(Mover.java:684)
>         at org.apache.hadoop.hdfs.server.mover.Mover$Cli.run(Mover.java:826)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
>         at org.apache.hadoop.hdfs.server.mover.Mover.main(Mover.java:908) 
> {code}
> 1、“final ExitStatus r = m.run()”return only after scheduled one of replica
> 2、“r == ExitStatus.IN_PROGRESS”,won’t run iter.remove()
> 3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” multiple 
> times for the same nnc，which leads to the error
> {code:java}
> //Mover.java
>  for (final StorageType t : diff.existing) {
>   for (final MLocation ml : locations) {
>     final Source source = storages.getSource(ml);
>     if (ml.storageType == t && source != null) {
>       // try to schedule one replica move.
>       if (scheduleMoveReplica(db, source, diff.expected)) { // 1、return only 
> after scheduled one of replica             
>          return true;
>       }
>     }
>   }
> }
> while (connectors.size() > 0) {
>   Collections.shuffle(connectors);
>   Iterator<NameNodeConnector> iter = connectors.iterator();
>   while (iter.hasNext()) {
>     NameNodeConnector nnc = iter.next();
> //3、Execute “new Mover” and “this.metrics = MoverMetrics.create(this)” 
> multiple times for the same nnc，which leads to the error
>      final Mover m = new Mover(nnc, conf, retryCount,   
>          excludedPinnedBlocks);
>     final ExitStatus r = m.run();
>     if (r == ExitStatus.SUCCESS) { // 2、r ==ExitStatus.IN_PROGRESS,won’t run 
> iter.remove()
>        IOUtils.cleanupWithLogger(LOG, nnc);
>       iter.remove();
>     } {code}
> Probably, we should initialize movermetrics when we initialize nnc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16867) Exiting Mover due to an exception in MoverMetrics.create

Reply via email to