[jira] [Work logged] (HDFS-16402) HeartbeatManager may cause incorrect stats

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16402?focusedWorklogId=705782=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705782
 ]

ASF GitHub Bot logged work on HDFS-16402:
-

Author: ASF GitHub Bot
Created on: 09/Jan/22 07:32
Start Date: 09/Jan/22 07:32
Worklog Time Spent: 10m 
  Work Description: Hexiaoqiao commented on pull request #3839:
URL: https://github.com/apache/hadoop/pull/3839#issuecomment-1008245403


   It's good catch here. Would you mind to add new test to cover this case?
   BTW, what is the root cause about NPE here? Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705782)
Time Spent: 0.5h  (was: 20m)

> HeartbeatManager may cause incorrect stats
> --
>
> Key: HDFS-16402
> URL: https://issues.apache.org/jira/browse/HDFS-16402
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-12-29-08-25-44-303.png, 
> image-2021-12-29-08-25-54-441.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> After reconfig {*}dfs.datanode.data.dir{*}, we found that the stats of the 
> Namenode Web became *negative* and there were many NPE in namenode logs. This 
> problem has been solved by HDFS-14042.
> !image-2021-12-29-08-25-54-441.png|width=681,height=293!
> !image-2021-12-29-08-25-44-303.png|width=677,height=180!
> However, if *HeartbeatManager#updateHeartbeat* and 
> *HeartbeatManager#updateLifeline* throw other exceptions, stats errors can 
> also occur. We should ensure that *stats.subtract()* and *stats.add()* are 
> transactional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16418) task

2022-01-08 Thread sundas khann (Jira)
sundas khann created HDFS-16418:
---

 Summary: task
 Key: HDFS-16418
 URL: https://issues.apache.org/jira/browse/HDFS-16418
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: sundas khann






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15923) RBF: Authentication failed when rename accross sub clusters

2022-01-08 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17471096#comment-17471096
 ] 

Íñigo Goiri commented on HDFS-15923:


Not fully related but we have seen failures in:
* hadoop.hdfs.server.federation.router.TestRouterFederationRename
* hadoop.hdfs.rbfbalance.TestRouterDistCpProcedure

In PR https://github.com/apache/hadoop/pull/3871 any idea on what the 
regression might be?

> RBF:  Authentication failed when rename accross sub clusters
> 
>
> Key: HDFS-15923
> URL: https://issues.apache.org/jira/browse/HDFS-15923
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Major
>  Labels: RBF, pull-request-available, rename
> Fix For: 3.4.0
>
> Attachments: HDFS-15923.001.patch, HDFS-15923.002.patch, 
> HDFS-15923.003.patch, HDFS-15923.stack-trace, 
> hdfs-15923-fix-security-issue.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Rename accross subcluster with RBF and Kerberos environment. Will encounter 
> the following two errors:
>  # Save Object to journal.
>  # Precheck try to get src file status
> So, we need use Router Login UGI doAs create DistcpProcedure and 
> TrashProcedure and submit Job.
>  
> Beside, we should check user permission for src and dst path in router side 
> before do rename internal. (HDFS-15973)
> First: Save Object to journal.
> {code:java}
> // code placeholder
> 2021-03-23 14:01:16,233 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:622)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:413)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:822)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:818)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:818)
> at 
> org.apache.hadoop.ipc.Client$Connection.access$3800(Client.java:413)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1636)
> at org.apache.hadoop.ipc.Client.call(Client.java:1452)
> at org.apache.hadoop.ipc.Client.call(Client.java:1405)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy11.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:376)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy12.create(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:277)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1240)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1219)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1201)
> at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1139)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:533)
> at 
> 

[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705723=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705723
 ]

ASF GitHub Bot logged work on HDFS-16417:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:37
Start Date: 08/Jan/22 10:37
Worklog Time Spent: 10m 
  Work Description: goiri commented on pull request #3871:
URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007952326


   @ferhui any idea on why those tests fail?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705723)
Time Spent: 1h 10m  (was: 1h)

> RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if 
> concurrent ns handler count is configured
> ---
>
> Key: HDFS-16417
> URL: https://issues.apache.org/jira/browse/HDFS-16417
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16417.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
> unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
> throw a /0 exception.
> Changed it to assigning extra handlers to concurrent ns in case it's 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705716
 ]

ASF GitHub Bot logged work on HDFS-16043:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:34
Start Date: 08/Jan/22 10:34
Worklog Time Spent: 10m 
  Work Description: zhuxiangyi commented on a change in pull request #3063:
URL: https://github.com/apache/hadoop/pull/3063#discussion_r780053962



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
##
@@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() {
 return lastRedundancyCycleTS.get();
   }
 
+  /**
+   * Periodically deletes the marked block.
+   */
+  private class MarkedDeleteBlockScrubber implements Runnable {
+private Iterator toDeleteIterator = null;
+private boolean isSleep;
+
+private void remove(long time) {
+  if (checkToDeleteIterator()) {
+namesystem.writeLock();
+try {
+  while (toDeleteIterator.hasNext()) {
+removeBlock(toDeleteIterator.next());
+if (Time.monotonicNow() - time > deleteBlockLockTimeMs) {
+  isSleep = true;
+  break;
+}
+  }
+} finally {
+  namesystem.writeUnlock();
+}
+  }
+}
+
+private boolean checkToDeleteIterator() {
+  return toDeleteIterator != null && toDeleteIterator.hasNext();
+}
+
+@Override
+public void run() {
+  LOG.info("Start MarkedDeleteBlockScrubber thread");
+  while (namesystem.isRunning() &&
+  !Thread.currentThread().isInterrupted()) {
+if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) {
+  try {
+NameNodeMetrics metrics = NameNode.getNameNodeMetrics();
+metrics.setDeleteBlocksQueued(markedDeleteQueue.size());
+isSleep = false;
+long startTime = Time.monotonicNow();
+remove(startTime);
+while (!isSleep && !markedDeleteQueue.isEmpty() &&
+!Thread.currentThread().isInterrupted()) {
+  List markedDeleteList = markedDeleteQueue.poll();
+  if (markedDeleteList != null) {
+toDeleteIterator = markedDeleteList.listIterator();
+  }
+  remove(startTime);
+}
+  } catch (Exception e){
+LOG.warn("MarkedDeleteBlockScrubber encountered an exception" +
+" during the block deletion process, " +
+" the deletion of the block will retry in {} millisecond.",
+deleteBlockUnlockIntervalTimeMs, e);
+  }
+}
+if (isSleep) {

Review comment:
   deleteBlockLockTimeMs is configured to 500ms. If it takes 10s to delete 
Blocks in a directory, there will be about 20 logs. There are not many cases of 
deleting a large directory, I think it is acceptable.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
##
@@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() {
 return lastRedundancyCycleTS.get();
   }
 
+  /**
+   * Periodically deletes the marked block.
+   */
+  private class MarkedDeleteBlockScrubber implements Runnable {
+private Iterator toDeleteIterator = null;
+private boolean isSleep;
+
+private void remove(long time) {
+  if (checkToDeleteIterator()) {
+namesystem.writeLock();
+try {
+  while (toDeleteIterator.hasNext()) {
+removeBlock(toDeleteIterator.next());
+if (Time.monotonicNow() - time > deleteBlockLockTimeMs) {
+  isSleep = true;
+  break;
+}
+  }
+} finally {
+  namesystem.writeUnlock();
+}
+  }
+}
+
+private boolean checkToDeleteIterator() {
+  return toDeleteIterator != null && toDeleteIterator.hasNext();
+}
+
+@Override
+public void run() {
+  LOG.info("Start MarkedDeleteBlockScrubber thread");
+  while (namesystem.isRunning() &&
+  !Thread.currentThread().isInterrupted()) {
+if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) {
+  try {
+NameNodeMetrics metrics = NameNode.getNameNodeMetrics();
+metrics.setDeleteBlocksQueued(markedDeleteQueue.size());
+isSleep = false;
+long startTime = Time.monotonicNow();
+remove(startTime);
+while (!isSleep && !markedDeleteQueue.isEmpty() &&
+!Thread.currentThread().isInterrupted()) {
+  List markedDeleteList = markedDeleteQueue.poll();
+  if (markedDeleteList != null) {
+toDeleteIterator = markedDeleteList.listIterator();
+  }
+   

[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705610=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705610
 ]

ASF GitHub Bot logged work on HDFS-16043:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:23
Start Date: 08/Jan/22 10:23
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #3063:
URL: https://github.com/apache/hadoop/pull/3063#issuecomment-1007206463






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705610)
Time Spent: 7.5h  (was: 7h 20m)

> Add markedDeleteBlockScrubberThread to delete blocks asynchronously
> ---
>
> Key: HDFS-16043
> URL: https://issues.apache.org/jira/browse/HDFS-16043
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namanode
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210527-after.svg, 20210527-before.svg
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> The deletion of the large directory caused NN to hold the lock for too long, 
> which caused our NameNode to be killed by ZKFC.
>  Through the flame graph, it is found that its main time-consuming 
> calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting 
> inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time.
> h3. solution:
> 1. RemoveBlocks is processed asynchronously. A thread is started in the 
> BlockManager to process the deleted blocks and control the lock time.
>  2. QuotaCount calculation optimization, this is similar to the optimization 
> of this Issue HDFS-16000.
> h3. Comparison before and after optimization:
> Delete 1000w Inode and 1000w block test.
>  *before:*
> remove inode elapsed time: 7691 ms
>  remove block elapsed time :11107 ms
>  *after:*
>  remove inode elapsed time: 4149 ms
>  remove block elapsed time :0 ms



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705489
 ]

ASF GitHub Bot logged work on HDFS-16043:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:12
Start Date: 08/Jan/22 10:12
Worklog Time Spent: 10m 
  Work Description: base111 commented on a change in pull request #3063:
URL: https://github.com/apache/hadoop/pull/3063#discussion_r780053528



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
##
@@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() {
 return lastRedundancyCycleTS.get();
   }
 
+  /**
+   * Periodically deletes the marked block.
+   */
+  private class MarkedDeleteBlockScrubber implements Runnable {
+private Iterator toDeleteIterator = null;
+private boolean isSleep;
+
+private void remove(long time) {
+  if (checkToDeleteIterator()) {
+namesystem.writeLock();
+try {
+  while (toDeleteIterator.hasNext()) {
+removeBlock(toDeleteIterator.next());
+if (Time.monotonicNow() - time > deleteBlockLockTimeMs) {
+  isSleep = true;
+  break;
+}
+  }
+} finally {
+  namesystem.writeUnlock();
+}
+  }
+}
+
+private boolean checkToDeleteIterator() {
+  return toDeleteIterator != null && toDeleteIterator.hasNext();
+}
+
+@Override
+public void run() {
+  LOG.info("Start MarkedDeleteBlockScrubber thread");
+  while (namesystem.isRunning() &&
+  !Thread.currentThread().isInterrupted()) {
+if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) {
+  try {
+NameNodeMetrics metrics = NameNode.getNameNodeMetrics();
+metrics.setDeleteBlocksQueued(markedDeleteQueue.size());
+isSleep = false;
+long startTime = Time.monotonicNow();
+remove(startTime);
+while (!isSleep && !markedDeleteQueue.isEmpty() &&
+!Thread.currentThread().isInterrupted()) {
+  List markedDeleteList = markedDeleteQueue.poll();
+  if (markedDeleteList != null) {
+toDeleteIterator = markedDeleteList.listIterator();
+  }
+  remove(startTime);
+}
+  } catch (Exception e){
+LOG.warn("MarkedDeleteBlockScrubber encountered an exception" +
+" during the block deletion process, " +
+" the deletion of the block will retry in {} millisecond.",
+deleteBlockUnlockIntervalTimeMs, e);
+  }
+}
+if (isSleep) {

Review comment:
   deleteBlockLockTimeMs is configured to 500ms. If it takes 10s to delete 
Blocks in a directory, there will be about 20 logs. There are not many cases of 
deleting a large directory, I think it is acceptable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705489)
Time Spent: 7h 20m  (was: 7h 10m)

> Add markedDeleteBlockScrubberThread to delete blocks asynchronously
> ---
>
> Key: HDFS-16043
> URL: https://issues.apache.org/jira/browse/HDFS-16043
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namanode
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210527-after.svg, 20210527-before.svg
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> The deletion of the large directory caused NN to hold the lock for too long, 
> which caused our NameNode to be killed by ZKFC.
>  Through the flame graph, it is found that its main time-consuming 
> calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting 
> inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time.
> h3. solution:
> 1. RemoveBlocks is processed asynchronously. A thread is started in the 
> BlockManager to process the deleted blocks and control the lock time.
>  2. QuotaCount calculation optimization, this is similar to the optimization 
> of this Issue HDFS-16000.
> h3. Comparison before and after optimization:
> Delete 1000w Inode and 1000w block test.
>  *before:*
> remove inode elapsed time: 7691 ms
>  remove block elapsed time :11107 ms
>  

[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705485=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705485
 ]

ASF GitHub Bot logged work on HDFS-16417:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:12
Start Date: 08/Jan/22 10:12
Worklog Time Spent: 10m 
  Work Description: kokonguyen191 commented on pull request #3871:
URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007345085


   Not sure why these 2 timed out, both pass normally when I run in local.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705485)
Time Spent: 1h  (was: 50m)

> RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if 
> concurrent ns handler count is configured
> ---
>
> Key: HDFS-16417
> URL: https://issues.apache.org/jira/browse/HDFS-16417
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16417.001.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
> unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
> throw a /0 exception.
> Changed it to assigning extra handlers to concurrent ns in case it's 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-16043) Add markedDeleteBlockScrubberThread to delete blocks asynchronously

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16043?focusedWorklogId=705443=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705443
 ]

ASF GitHub Bot logged work on HDFS-16043:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:07
Start Date: 08/Jan/22 10:07
Worklog Time Spent: 10m 
  Work Description: tomscut commented on a change in pull request #3063:
URL: https://github.com/apache/hadoop/pull/3063#discussion_r780055220



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
##
@@ -4909,6 +4932,75 @@ public long getLastRedundancyMonitorTS() {
 return lastRedundancyCycleTS.get();
   }
 
+  /**
+   * Periodically deletes the marked block.
+   */
+  private class MarkedDeleteBlockScrubber implements Runnable {
+private Iterator toDeleteIterator = null;
+private boolean isSleep;
+
+private void remove(long time) {
+  if (checkToDeleteIterator()) {
+namesystem.writeLock();
+try {
+  while (toDeleteIterator.hasNext()) {
+removeBlock(toDeleteIterator.next());
+if (Time.monotonicNow() - time > deleteBlockLockTimeMs) {
+  isSleep = true;
+  break;
+}
+  }
+} finally {
+  namesystem.writeUnlock();
+}
+  }
+}
+
+private boolean checkToDeleteIterator() {
+  return toDeleteIterator != null && toDeleteIterator.hasNext();
+}
+
+@Override
+public void run() {
+  LOG.info("Start MarkedDeleteBlockScrubber thread");
+  while (namesystem.isRunning() &&
+  !Thread.currentThread().isInterrupted()) {
+if (!markedDeleteQueue.isEmpty() || checkToDeleteIterator()) {
+  try {
+NameNodeMetrics metrics = NameNode.getNameNodeMetrics();
+metrics.setDeleteBlocksQueued(markedDeleteQueue.size());
+isSleep = false;
+long startTime = Time.monotonicNow();
+remove(startTime);
+while (!isSleep && !markedDeleteQueue.isEmpty() &&
+!Thread.currentThread().isInterrupted()) {
+  List markedDeleteList = markedDeleteQueue.poll();
+  if (markedDeleteList != null) {
+toDeleteIterator = markedDeleteList.listIterator();
+  }
+  remove(startTime);
+}
+  } catch (Exception e){
+LOG.warn("MarkedDeleteBlockScrubber encountered an exception" +
+" during the block deletion process, " +
+" the deletion of the block will retry in {} millisecond.",
+deleteBlockUnlockIntervalTimeMs, e);
+  }
+}
+if (isSleep) {

Review comment:
   At present, there are warn logs whose lock holding exceeds the 
threshold. The same logs are generated here. Can we change the log level to 
DEBUG?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705443)
Time Spent: 7h 10m  (was: 7h)

> Add markedDeleteBlockScrubberThread to delete blocks asynchronously
> ---
>
> Key: HDFS-16043
> URL: https://issues.apache.org/jira/browse/HDFS-16043
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namanode
>Affects Versions: 3.4.0
>Reporter: Xiangyi Zhu
>Assignee: Xiangyi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20210527-after.svg, 20210527-before.svg
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> The deletion of the large directory caused NN to hold the lock for too long, 
> which caused our NameNode to be killed by ZKFC.
>  Through the flame graph, it is found that its main time-consuming 
> calculation is QuotaCount when removingBlocks(toRemovedBlocks) and deleting 
> inodes, and removeBlocks(toRemovedBlocks) takes a higher proportion of time.
> h3. solution:
> 1. RemoveBlocks is processed asynchronously. A thread is started in the 
> BlockManager to process the deleted blocks and control the lock time.
>  2. QuotaCount calculation optimization, this is similar to the optimization 
> of this Issue HDFS-16000.
> h3. Comparison before and after optimization:
> Delete 1000w Inode and 1000w block test.
>  *before:*
> remove inode elapsed time: 7691 ms
>  remove block elapsed time :11107 ms
>  *after:*
>  remove inode elapsed time: 4149 ms
>  remove block 

[jira] [Work logged] (HDFS-16417) RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if concurrent ns handler count is configured

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16417?focusedWorklogId=705419=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-705419
 ]

ASF GitHub Bot logged work on HDFS-16417:
-

Author: ASF GitHub Bot
Created on: 08/Jan/22 10:04
Start Date: 08/Jan/22 10:04
Worklog Time Spent: 10m 
  Work Description: goiri commented on pull request #3871:
URL: https://github.com/apache/hadoop/pull/3871#issuecomment-1007276928


   There seems to fail in:
   
   hadoop.hdfs.server.federation.router.TestRouterFederationRename
   hadoop.hdfs.rbfbalance.TestRouterDistCpProcedure
   
   Can you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 705419)
Time Spent: 50m  (was: 40m)

> RBF: StaticRouterRpcFairnessPolicyController init fails with division by 0 if 
> concurrent ns handler count is configured
> ---
>
> Key: HDFS-16417
> URL: https://issues.apache.org/jira/browse/HDFS-16417
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: Felix N
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16417.001.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> If {{dfs.federation.router.fairness.handler.count.concurrent}} is configured, 
> unassignedNS is thus empty and {{handlerCount % unassignedNS.size()}} will 
> throw a /0 exception.
> Changed it to assigning extra handlers to concurrent ns in case it's 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org