[jira] [Work logged] (HDFS-16402) HeartbeatManager may cause incorrect stats
[ https://issues.apache.org/jira/browse/HDFS-16402?focusedWorklogId=705782=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705782 ] ASF GitHub Bot logged work on HDFS-16402: - Author: ASF GitHub Bot Created on: 09/Jan/22 07:32 Start Date: 09/Jan/22 07:32 Worklog Time Spent: 10m Work Description: Hexiaoqiao commented on pull request #3839: URL: https://github.com/apache/hadoop/pull/3839#issuecomment-1008245403 It's good catch here. Would you mind to add new test to cover this case? BTW, what is the root cause about NPE here? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705782) Time Spent: 0.5h (was: 20m) > HeartbeatManager may cause incorrect stats > -- > > Key: HDFS-16402 > URL: https://issues.apache.org/jira/browse/HDFS-16402 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2021-12-29-08-25-44-303.png, > image-2021-12-29-08-25-54-441.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the > Namenode Web became *negative* and there were many NPE in namenode logs. This > problem has been solved by HDFS-14042. > !image-2021-12-29-08-25-54-441.png|width=681,height=293! > !image-2021-12-29-08-25-44-303.png|width=677,height=180! > However, if *HeartbeatManager#updateHeartbeat* and > *HeartbeatManager#updateLifeline* throw other exceptions, stats errors can > also occur. We should ensure that *stats.subtract()* and *stats.add()* are > transactional. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16418) task
sundas khann created HDFS-16418: --- Summary: task Key: HDFS-16418 URL: https://issues.apache.org/jira/browse/HDFS-16418 Project: Hadoop HDFS Issue Type: Task Reporter: sundas khann -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters
[ https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471096#comment-17471096 ] Íñigo Goiri commented on HDFS-15923: Not fully related but we have seen failures in: * hadoop.hdfs.server.federation.router.TestRouterFederationRename * hadoop.hdfs.rbfbalance.TestRouterDistCpProcedure In PR https://github.com/apache/hadoop/pull/3871 any idea on what the regression might be? > RBF: Authentication failed when rename accross sub clusters > > > Key: HDFS-15923 > URL: https://issues.apache.org/jira/browse/HDFS-15923 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: zhuobin zheng >Assignee: zhuobin zheng >Priority: Major > Labels: RBF, pull-request-available, rename > Fix For: 3.4.0 > > Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, > HDFS-15923.003.patch, HDFS-15923.stack-trace, > hdfs-15923-fix-security-issue.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > Rename accross subcluster with RBF and Kerberos environment. Will encounter > the following two errors: > # Save Object to journal. > # Precheck try to get src file status > So, we need use Router Login UGI doAs create DistcpProcedure and > TrashProcedure and submit Job. > > Beside, we should check user permission for src and dst path in router side > before do rename internal. (HDFS-15973) > First: Save Object to journal. > {code:java} > // code placeholder > 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622) > at > org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818) > at > org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636) > at org.apache.hadoop.ipc.Client.call(Client.java:1452) > at org.apache.hadoop.ipc.Client.call(Client.java:1405) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118) > at com.sun.proxy.$Proxy11.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy12.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533) > at >
[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
[ https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705723=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705723 ] ASF GitHub Bot logged work on HDFS-16417: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:37 Start Date: 08/Jan/22 10:37 Worklog Time Spent: 10m Work Description: goiri commented on pull request #3871: URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007952326 @ferhui any idea on why those tests fail? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705723) Time Spent: 1h 10m (was: 1h) > RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if > concurrent ns handler count is configured > --- > > Key: HDFS-16417 > URL: https://issues.apache.org/jira/browse/HDFS-16417 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Felix N >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-16417.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, > unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will > throw a /0 exception. > Changed it to assigning extra handlers to concurrent ns in case it's > configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously
[ https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705716 ] ASF GitHub Bot logged work on HDFS-16043: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:34 Start Date: 08/Jan/22 10:34 Worklog Time Spent: 10m Work Description: zhuxiangyi commented on a change in pull request #3063: URL: https://github.com/apache/hadoop/pull/3063#discussion_r780053962 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java ## @@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() { return lastRedundancyCycleTS.get(); } + /** + * Periodically deletes the marked block. + */ + private class MarkedDeleteBlockScrubber implements Runnable { +private Iterator toDeleteIterator = null; +private boolean isSleep; + +private void remove(long time) { + if (checkToDeleteIterator()) { +namesystem.writeLock(); +try { + while (toDeleteIterator.hasNext()) { +removeBlock(toDeleteIterator.next()); +if (Time.monotonicNow() - time > deleteBlockLockTimeMs) { + isSleep = true; + break; +} + } +} finally { + namesystem.writeUnlock(); +} + } +} + +private boolean checkToDeleteIterator() { + return toDeleteIterator != null && toDeleteIterator.hasNext(); +} + +@Override +public void run() { + LOG.info("Start MarkedDeleteBlockScrubber thread"); + while (namesystem.isRunning() && + !Thread.currentThread().isInterrupted()) { +if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) { + try { +NameNodeMetrics metrics = NameNode.getNameNodeMetrics(); +metrics.setDeleteBlocksQueued(markedDeleteQueue.size()); +isSleep = false; +long startTime = Time.monotonicNow(); +remove(startTime); +while (!isSleep && !markedDeleteQueue.isEmpty() && +!Thread.currentThread().isInterrupted()) { + List markedDeleteList = markedDeleteQueue.poll(); + if (markedDeleteList != null) { +toDeleteIterator = markedDeleteList.listIterator(); + } + remove(startTime); +} + } catch (Exception e){ +LOG.warn("MarkedDeleteBlockScrubber encountered an exception" + +" during the block deletion process, " + +" the deletion of the block will retry in {} millisecond.", +deleteBlockUnlockIntervalTimeMs, e); + } +} +if (isSleep) { Review comment: deleteBlockLockTimeMs is configured to 500ms. If it takes 10s to delete Blocks in a directory, there will be about 20 logs. There are not many cases of deleting a large directory, I think it is acceptable. ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java ## @@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() { return lastRedundancyCycleTS.get(); } + /** + * Periodically deletes the marked block. + */ + private class MarkedDeleteBlockScrubber implements Runnable { +private Iterator toDeleteIterator = null; +private boolean isSleep; + +private void remove(long time) { + if (checkToDeleteIterator()) { +namesystem.writeLock(); +try { + while (toDeleteIterator.hasNext()) { +removeBlock(toDeleteIterator.next()); +if (Time.monotonicNow() - time > deleteBlockLockTimeMs) { + isSleep = true; + break; +} + } +} finally { + namesystem.writeUnlock(); +} + } +} + +private boolean checkToDeleteIterator() { + return toDeleteIterator != null && toDeleteIterator.hasNext(); +} + +@Override +public void run() { + LOG.info("Start MarkedDeleteBlockScrubber thread"); + while (namesystem.isRunning() && + !Thread.currentThread().isInterrupted()) { +if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) { + try { +NameNodeMetrics metrics = NameNode.getNameNodeMetrics(); +metrics.setDeleteBlocksQueued(markedDeleteQueue.size()); +isSleep = false; +long startTime = Time.monotonicNow(); +remove(startTime); +while (!isSleep && !markedDeleteQueue.isEmpty() && +!Thread.currentThread().isInterrupted()) { + List markedDeleteList = markedDeleteQueue.poll(); + if (markedDeleteList != null) { +toDeleteIterator = markedDeleteList.listIterator(); + } +
[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously
[ https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705610=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705610 ] ASF GitHub Bot logged work on HDFS-16043: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:23 Start Date: 08/Jan/22 10:23 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #3063: URL: https://github.com/apache/hadoop/pull/3063#issuecomment-1007206463 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705610) Time Spent: 7.5h (was: 7h 20m) > Add markedDeleteBlockScrubberThread to delete blocks asynchronously > --- > > Key: HDFS-16043 > URL: https://issues.apache.org/jira/browse/HDFS-16043 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namanode >Affects Versions: 3.4.0 >Reporter: Xiangyi Zhu >Assignee: Xiangyi Zhu >Priority: Major > Labels: pull-request-available > Attachments: 20210527-after.svg, 20210527-before.svg > > Time Spent: 7.5h > Remaining Estimate: 0h > > The deletion of the large directory caused NN to hold the lock for too long, > which caused our NameNode to be killed by ZKFC. > Through the flame graph, it is found that its main time-consuming > calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting > inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time. > h3. solution: > 1. RemoveBlocks is processed asynchronously. A thread is started in the > BlockManager to process the deleted blocks and control the lock time. > 2. QuotaCount calculation optimization, this is similar to the optimization > of this Issue HDFS-16000. > h3. Comparison before and after optimization: > Delete 1000w Inode and 1000w block test. > *before:* > remove inode elapsed time: 7691 ms > remove block elapsed time :11107 ms > *after:* > remove inode elapsed time: 4149 ms > remove block elapsed time :0 ms -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously
[ https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705489 ] ASF GitHub Bot logged work on HDFS-16043: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:12 Start Date: 08/Jan/22 10:12 Worklog Time Spent: 10m Work Description: base111 commented on a change in pull request #3063: URL: https://github.com/apache/hadoop/pull/3063#discussion_r780053528 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java ## @@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() { return lastRedundancyCycleTS.get(); } + /** + * Periodically deletes the marked block. + */ + private class MarkedDeleteBlockScrubber implements Runnable { +private Iterator toDeleteIterator = null; +private boolean isSleep; + +private void remove(long time) { + if (checkToDeleteIterator()) { +namesystem.writeLock(); +try { + while (toDeleteIterator.hasNext()) { +removeBlock(toDeleteIterator.next()); +if (Time.monotonicNow() - time > deleteBlockLockTimeMs) { + isSleep = true; + break; +} + } +} finally { + namesystem.writeUnlock(); +} + } +} + +private boolean checkToDeleteIterator() { + return toDeleteIterator != null && toDeleteIterator.hasNext(); +} + +@Override +public void run() { + LOG.info("Start MarkedDeleteBlockScrubber thread"); + while (namesystem.isRunning() && + !Thread.currentThread().isInterrupted()) { +if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) { + try { +NameNodeMetrics metrics = NameNode.getNameNodeMetrics(); +metrics.setDeleteBlocksQueued(markedDeleteQueue.size()); +isSleep = false; +long startTime = Time.monotonicNow(); +remove(startTime); +while (!isSleep && !markedDeleteQueue.isEmpty() && +!Thread.currentThread().isInterrupted()) { + List markedDeleteList = markedDeleteQueue.poll(); + if (markedDeleteList != null) { +toDeleteIterator = markedDeleteList.listIterator(); + } + remove(startTime); +} + } catch (Exception e){ +LOG.warn("MarkedDeleteBlockScrubber encountered an exception" + +" during the block deletion process, " + +" the deletion of the block will retry in {} millisecond.", +deleteBlockUnlockIntervalTimeMs, e); + } +} +if (isSleep) { Review comment: deleteBlockLockTimeMs is configured to 500ms. If it takes 10s to delete Blocks in a directory, there will be about 20 logs. There are not many cases of deleting a large directory, I think it is acceptable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705489) Time Spent: 7h 20m (was: 7h 10m) > Add markedDeleteBlockScrubberThread to delete blocks asynchronously > --- > > Key: HDFS-16043 > URL: https://issues.apache.org/jira/browse/HDFS-16043 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namanode >Affects Versions: 3.4.0 >Reporter: Xiangyi Zhu >Assignee: Xiangyi Zhu >Priority: Major > Labels: pull-request-available > Attachments: 20210527-after.svg, 20210527-before.svg > > Time Spent: 7h 20m > Remaining Estimate: 0h > > The deletion of the large directory caused NN to hold the lock for too long, > which caused our NameNode to be killed by ZKFC. > Through the flame graph, it is found that its main time-consuming > calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting > inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time. > h3. solution: > 1. RemoveBlocks is processed asynchronously. A thread is started in the > BlockManager to process the deleted blocks and control the lock time. > 2. QuotaCount calculation optimization, this is similar to the optimization > of this Issue HDFS-16000. > h3. Comparison before and after optimization: > Delete 1000w Inode and 1000w block test. > *before:* > remove inode elapsed time: 7691 ms > remove block elapsed time :11107 ms >
[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
[ https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705485=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705485 ] ASF GitHub Bot logged work on HDFS-16417: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:12 Start Date: 08/Jan/22 10:12 Worklog Time Spent: 10m Work Description: kokonguyen191 commented on pull request #3871: URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007345085 Not sure why these 2 timed out, both pass normally when I run in local. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705485) Time Spent: 1h (was: 50m) > RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if > concurrent ns handler count is configured > --- > > Key: HDFS-16417 > URL: https://issues.apache.org/jira/browse/HDFS-16417 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Felix N >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-16417.001.patch > > Time Spent: 1h > Remaining Estimate: 0h > > If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, > unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will > throw a /0 exception. > Changed it to assigning extra handlers to concurrent ns in case it's > configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously
[ https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705443=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705443 ] ASF GitHub Bot logged work on HDFS-16043: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:07 Start Date: 08/Jan/22 10:07 Worklog Time Spent: 10m Work Description: tomscut commented on a change in pull request #3063: URL: https://github.com/apache/hadoop/pull/3063#discussion_r780055220 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java ## @@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() { return lastRedundancyCycleTS.get(); } + /** + * Periodically deletes the marked block. + */ + private class MarkedDeleteBlockScrubber implements Runnable { +private Iterator toDeleteIterator = null; +private boolean isSleep; + +private void remove(long time) { + if (checkToDeleteIterator()) { +namesystem.writeLock(); +try { + while (toDeleteIterator.hasNext()) { +removeBlock(toDeleteIterator.next()); +if (Time.monotonicNow() - time > deleteBlockLockTimeMs) { + isSleep = true; + break; +} + } +} finally { + namesystem.writeUnlock(); +} + } +} + +private boolean checkToDeleteIterator() { + return toDeleteIterator != null && toDeleteIterator.hasNext(); +} + +@Override +public void run() { + LOG.info("Start MarkedDeleteBlockScrubber thread"); + while (namesystem.isRunning() && + !Thread.currentThread().isInterrupted()) { +if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) { + try { +NameNodeMetrics metrics = NameNode.getNameNodeMetrics(); +metrics.setDeleteBlocksQueued(markedDeleteQueue.size()); +isSleep = false; +long startTime = Time.monotonicNow(); +remove(startTime); +while (!isSleep && !markedDeleteQueue.isEmpty() && +!Thread.currentThread().isInterrupted()) { + List markedDeleteList = markedDeleteQueue.poll(); + if (markedDeleteList != null) { +toDeleteIterator = markedDeleteList.listIterator(); + } + remove(startTime); +} + } catch (Exception e){ +LOG.warn("MarkedDeleteBlockScrubber encountered an exception" + +" during the block deletion process, " + +" the deletion of the block will retry in {} millisecond.", +deleteBlockUnlockIntervalTimeMs, e); + } +} +if (isSleep) { Review comment: At present, there are warn logs whose lock holding exceeds the threshold. The same logs are generated here. Can we change the log level to DEBUG? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705443) Time Spent: 7h 10m (was: 7h) > Add markedDeleteBlockScrubberThread to delete blocks asynchronously > --- > > Key: HDFS-16043 > URL: https://issues.apache.org/jira/browse/HDFS-16043 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs, namanode >Affects Versions: 3.4.0 >Reporter: Xiangyi Zhu >Assignee: Xiangyi Zhu >Priority: Major > Labels: pull-request-available > Attachments: 20210527-after.svg, 20210527-before.svg > > Time Spent: 7h 10m > Remaining Estimate: 0h > > The deletion of the large directory caused NN to hold the lock for too long, > which caused our NameNode to be killed by ZKFC. > Through the flame graph, it is found that its main time-consuming > calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting > inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time. > h3. solution: > 1. RemoveBlocks is processed asynchronously. A thread is started in the > BlockManager to process the deleted blocks and control the lock time. > 2. QuotaCount calculation optimization, this is similar to the optimization > of this Issue HDFS-16000. > h3. Comparison before and after optimization: > Delete 1000w Inode and 1000w block test. > *before:* > remove inode elapsed time: 7691 ms > remove block elapsed time :11107 ms > *after:* > remove inode elapsed time: 4149 ms > remove block
[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured
[ https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705419=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705419 ] ASF GitHub Bot logged work on HDFS-16417: - Author: ASF GitHub Bot Created on: 08/Jan/22 10:04 Start Date: 08/Jan/22 10:04 Worklog Time Spent: 10m Work Description: goiri commented on pull request #3871: URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007276928 There seems to fail in: hadoop.hdfs.server.federation.router.TestRouterFederationRename hadoop.hdfs.rbfbalance.TestRouterDistCpProcedure Can you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 705419) Time Spent: 50m (was: 40m) > RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if > concurrent ns handler count is configured > --- > > Key: HDFS-16417 > URL: https://issues.apache.org/jira/browse/HDFS-16417 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: Felix N >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-16417.001.patch > > Time Spent: 50m > Remaining Estimate: 0h > > If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, > unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will > throw a /0 exception. > Changed it to assigning extra handlers to concurrent ns in case it's > configured. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org