date:20240320

[jira] [Resolved] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf

2024-03-20 Thread Takanobu Asanuma (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma resolved HDFS-17432.
-
Fix Version/s: 3.4.1
   3.5.0
   Resolution: Fixed

> Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
> -
>
> Key: HDFS-17432
> URL: https://issues.apache.org/jira/browse/HDFS-17432
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.1, 3.5.0
>
>
> After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable 
> both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to 
> the hadoop-hdfs-rbf/pom.xml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829384#comment-17829384
 ] 

ASF GitHub Bot commented on HDFS-17432:
---

tasanuma commented on PR #6639:
URL: https://github.com/apache/hadoop/pull/6639#issuecomment-2011271575

   Thanks again for your review, @dineshchitlangia.




> Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
> -
>
> Key: HDFS-17432
> URL: https://issues.apache.org/jira/browse/HDFS-17432
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable 
> both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to 
> the hadoop-hdfs-rbf/pom.xml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829383#comment-17829383
 ] 

ASF GitHub Bot commented on HDFS-17432:
---

tasanuma merged PR #6639:
URL: https://github.com/apache/hadoop/pull/6639




> Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
> -
>
> Key: HDFS-17432
> URL: https://issues.apache.org/jira/browse/HDFS-17432
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable 
> both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to 
> the hadoop-hdfs-rbf/pom.xml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829381#comment-17829381
 ] 

ASF GitHub Bot commented on HDFS-17432:
---

tasanuma commented on PR #6639:
URL: https://github.com/apache/hadoop/pull/6639#issuecomment-2011270228

   The failed tests are caused by HDFS-17354. I created HDFS-17435 for 
addressing the issue.
   
   This PR doesn't cause the failed tests. So I'm merging it.




> Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
> -
>
> Key: HDFS-17432
> URL: https://issues.apache.org/jira/browse/HDFS-17432
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>  Labels: pull-request-available
>
> After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable 
> both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to 
> the hadoop-hdfs-rbf/pom.xml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-17435) Fix TestRouterRpc#testClearStaleNamespacesInRouterStateIdContext() failed

2024-03-20 Thread Takanobu Asanuma (Jira)

Takanobu Asanuma created HDFS-17435:
---

 Summary: Fix 
TestRouterRpc#testClearStaleNamespacesInRouterStateIdContext() failed
 Key: HDFS-17435
 URL: https://issues.apache.org/jira/browse/HDFS-17435
 Project: Hadoop HDFS
  Issue Type: Test
Reporter: Takanobu Asanuma


TestRouterRpc and TestRouterRpcMultiDestination are failing with the following 
error.
{noformat}
[ERROR] testProxyGetBlockKeys  Time elapsed: 0.573 s  <<< ERROR!
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException):
 User: jenkins is not allowed to impersonate jenkins
{noformat}
This is caused by testClearStaleNamespacesInRouterStateIdContext() which is 
implemented by HDFS-17354.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-20 Thread Dinesh Chitlangia (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Chitlangia resolved HDFS-17433.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

> metrics sumOfActorCommandQueueLength should only record valid commands
> --
>
> Key: HDFS-17433
> URL: https://issues.apache.org/jira/browse/HDFS-17433
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829368#comment-17829368
 ] 

ASF GitHub Bot commented on HDFS-17433:
---

dineshchitlangia merged PR #6644:
URL: https://github.com/apache/hadoop/pull/6644




> metrics sumOfActorCommandQueueLength should only record valid commands
> --
>
> Key: HDFS-17433
> URL: https://issues.apache.org/jira/browse/HDFS-17433
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829367#comment-17829367
 ] 

ASF GitHub Bot commented on HDFS-17433:
---

dineshchitlangia commented on PR #6644:
URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2011163090

   Thanks @hfutatzhanghb  for the contribution  and @shardulsadavarte for the 
review.




> metrics sumOfActorCommandQueueLength should only record valid commands
> --
>
> Key: HDFS-17433
> URL: https://issues.apache.org/jira/browse/HDFS-17433
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.4.0
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread farmmamba (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829366#comment-17829366
 ] 

farmmamba commented on HDFS-17434:
--

[~qinyuren] Hi, could you please show your createRbw avgTime?

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, 
> image-2024-03-20-19-55-18-378.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 2Mb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> !image-2024-03-20-19-55-18-378.png!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17416) [FGL] Monitor threads in BlockManager.class support fine-grained lock

2024-03-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17416:
--
Labels: pull-request-available  (was: )

> [FGL] Monitor threads in BlockManager.class support fine-grained lock
> -
>
> Key: HDFS-17416
> URL: https://issues.apache.org/jira/browse/HDFS-17416
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> There are some monitor threads in BlockManager.class.
>  
> This ticket is used to make these threads supporting fine-grained locking.
>  * BlockReportProcessingThread
>  * MarkedDeleteBlockScrubber
>  * RedundancyMonitor
>  * Reconstruction Queue Initializer
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17416) [FGL] Monitor threads in BlockManager.class support fine-grained lock

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829365#comment-17829365
 ] 

ASF GitHub Bot commented on HDFS-17416:
---

ZanderXu opened a new pull request, #6647:
URL: https://github.com/apache/hadoop/pull/6647

   Threads in BlockManager.class support fine-grained lock.
   
   - BlockReportProcessingThread
   - MarkedDeleteBlockScrubber
   - RedundancyMonitor
   - Reconstruction Queue Initializer
   
   Normally, these threads just need BMReadLock or BMWriteLock, but there are 
some cases still need FSReadLock and FSWriteLock.
   
   - UpdateQuota while completing one block
   - GetStoragePolicyId while removing excess replicas
   - GetFullPath while checking if it is snapshot




> [FGL] Monitor threads in BlockManager.class support fine-grained lock
> -
>
> Key: HDFS-17416
> URL: https://issues.apache.org/jira/browse/HDFS-17416
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>
> There are some monitor threads in BlockManager.class.
>  
> This ticket is used to make these threads supporting fine-grained locking.
>  * BlockReportProcessingThread
>  * MarkedDeleteBlockScrubber
>  * RedundancyMonitor
>  * Reconstruction Queue Initializer
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829364#comment-17829364
 ] 

ASF GitHub Bot commented on HDFS-17430:
---

haiyang1987 commented on code in PR #6635:
URL: https://github.com/apache/hadoop/pull/6635#discussion_r1533200066


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -1755,12 +1755,24 @@ private BlockRecoveryCommand 
getBlockRecoveryCommand(String blockPoolId,
   LOG.info("Skipped stale nodes for recovery : "
   + (storages.length - recoveryLocations.size()));
 }
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
   } else {
-// If too many replicas are stale, then choose all replicas to
+// If too many replicas are stale, then choose live replicas to
 // participate in block recovery.
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
+recoveryLocations.clear();
+storageIdx.clear();
+for (int i = 0; i < storages.length; ++i) {
+  if (storages[i].getDatanodeDescriptor().isAlive()) {
+recoveryLocations.add(storages[i]);
+storageIdx.add(i);
+  }
+}
+assert recoveryLocations.size() > 0 : "recoveryLocations size should 
be > 0";

Review Comment:
   Check the code again. when processing handleHeartbeat executes 
getBlockRecoveryCommand, the datanode  should be in the live state at this 
time, so the size of recoveryLocations is at least 1.
   so here maybey remove this assert logic.





> RecoveringBlock will skip no live replicas when get block recovery command.
> ---
>
> Key: HDFS-17430
> URL: https://issues.apache.org/jira/browse/HDFS-17430
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> RecoveringBlock maybe skip no live replicas when get block recovery command.
> *Issue:*
> Currently the following scenarios may lead to failure in the execution of 
> BlockRecoveryWorker by the datanode, resulting file being not to be closed 
> for a long time.
> *t1.*  The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down 
> and will be dead status, the dn2 is live status.
> *t2.* Occurs block recovery.
> related logs：
> {code:java}
> 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx
> {code}
> *t3.*  The dn2 is chosen for block recovery.
> dn1 is marked as stale (is dead state) at this time, here the 
> recoveryLocations size is 1, currently according to the following logic, dn1 
> and dn2 will be chosen to participate in block recovery.
> DatanodeManager#getBlockRecoveryCommand
> {code:java}
>// Skip stale nodes during recovery
>  final List recoveryLocations =
>  new ArrayList<>(storages.length);
>  final List storageIdx = new ArrayList<>(storages.length);
>  for (int i = 0; i < storages.length; ++i) {
>if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
>  recoveryLocations.add(storages[i]);
>  storageIdx.add(i);
>}
>  }
>  ...
>  // If we only get 1 replica after eliminating stale nodes, choose all
>  // replicas for recovery and let the primary data node handle failures.
>  DatanodeInfo[] recoveryInfos;
>  if (recoveryLocations.size() > 1) {
>if (recoveryLocations.size() != storages.length) {
>  LOG.info("Skipped stale nodes for recovery : "
>  + (storages.length - recoveryLocations.size()));
>}
>recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
>  } else {
>// If too many replicas are stale, then choose all replicas to
>// participate in block recovery.
>recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
>  }
> {code}
> {code:java}
> 2024-03-13 21:58:01,425 INFO  datanode.DataNode 
> (BlockRecoveryWorker.java:logRecoverBlock(563))
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
> BlockRecoveryWorker: NameNode at xxx:8040 calls 
> recoverBlock(BP-xxx:blk_xxx_xxx, 
> targets=[DatanodeInfoWithStorage[dn1:50010,null,null], 
> DatanodeInfoWithStorage[dn2:50010,null,null]], 
> newGenerationStamp=28577373754, newBlock=null, isStriped=false)
> {code}
> *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call 
> initReplicaRecovery operation on dn1, however, since the dn1 machine is 
> currently down state at this time, it will t

[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829363#comment-17829363
 ] 

ASF GitHub Bot commented on HDFS-17430:
---

haiyang1987 commented on code in PR #6635:
URL: https://github.com/apache/hadoop/pull/6635#discussion_r1533195923


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -1755,12 +1755,24 @@ private BlockRecoveryCommand 
getBlockRecoveryCommand(String blockPoolId,
   LOG.info("Skipped stale nodes for recovery : "
   + (storages.length - recoveryLocations.size()));
 }
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
   } else {
-// If too many replicas are stale, then choose all replicas to
+// If too many replicas are stale, then choose live replicas to
 // participate in block recovery.
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
+recoveryLocations.clear();
+storageIdx.clear();
+for (int i = 0; i < storages.length; ++i) {
+  if (storages[i].getDatanodeDescriptor().isAlive()) {

Review Comment:
   Thanks @Hexiaoqiao for you comment.
   Sir suggestion is here only will choose non stale and is live replicas to 
participate in block recovery ?





> RecoveringBlock will skip no live replicas when get block recovery command.
> ---
>
> Key: HDFS-17430
> URL: https://issues.apache.org/jira/browse/HDFS-17430
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> RecoveringBlock maybe skip no live replicas when get block recovery command.
> *Issue:*
> Currently the following scenarios may lead to failure in the execution of 
> BlockRecoveryWorker by the datanode, resulting file being not to be closed 
> for a long time.
> *t1.*  The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down 
> and will be dead status, the dn2 is live status.
> *t2.* Occurs block recovery.
> related logs：
> {code:java}
> 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx
> {code}
> *t3.*  The dn2 is chosen for block recovery.
> dn1 is marked as stale (is dead state) at this time, here the 
> recoveryLocations size is 1, currently according to the following logic, dn1 
> and dn2 will be chosen to participate in block recovery.
> DatanodeManager#getBlockRecoveryCommand
> {code:java}
>// Skip stale nodes during recovery
>  final List recoveryLocations =
>  new ArrayList<>(storages.length);
>  final List storageIdx = new ArrayList<>(storages.length);
>  for (int i = 0; i < storages.length; ++i) {
>if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
>  recoveryLocations.add(storages[i]);
>  storageIdx.add(i);
>}
>  }
>  ...
>  // If we only get 1 replica after eliminating stale nodes, choose all
>  // replicas for recovery and let the primary data node handle failures.
>  DatanodeInfo[] recoveryInfos;
>  if (recoveryLocations.size() > 1) {
>if (recoveryLocations.size() != storages.length) {
>  LOG.info("Skipped stale nodes for recovery : "
>  + (storages.length - recoveryLocations.size()));
>}
>recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
>  } else {
>// If too many replicas are stale, then choose all replicas to
>// participate in block recovery.
>recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
>  }
> {code}
> {code:java}
> 2024-03-13 21:58:01,425 INFO  datanode.DataNode 
> (BlockRecoveryWorker.java:logRecoverBlock(563))
> [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] -
> BlockRecoveryWorker: NameNode at xxx:8040 calls 
> recoverBlock(BP-xxx:blk_xxx_xxx, 
> targets=[DatanodeInfoWithStorage[dn1:50010,null,null], 
> DatanodeInfoWithStorage[dn2:50010,null,null]], 
> newGenerationStamp=28577373754, newBlock=null, isStriped=false)
> {code}
> *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call 
> initReplicaRecovery operation on dn1, however, since the dn1 machine is 
> currently down state at this time, it will take a very long time to timeout,  
> the default number of retries to establish a server connection is 45 times.
> related logs：
> {code:java}
> 2024-03-13 21:59:31,518 INFO  ipc.Client 
> (Client.java:handleConnectionTimeout(904)) 
> [org.apache.hadoop.hdfs.server.datanode.BlockRecov

[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829358#comment-17829358
 ] 

ASF GitHub Bot commented on HDFS-17426:
---

haiyang1987 commented on PR #6628:
URL: https://github.com/apache/hadoop/pull/6628#issuecomment-2011103924

   Thanks @ZanderXu for your review and merge~




> Remove Invalid FileSystemECReadStats logic in DFSInputStream
> 
>
> Key: HDFS-17426
> URL: https://issues.apache.org/jira/browse/HDFS-17426
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When read the replication file, the following logic will be called when 
> _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in 
> DFSInputStream.java 
> {code:java}
>  if (readStatistics.getBlockType() == BlockType.STRIPED) {
>   dfsClient.updateFileSystemECReadStats(nread);
>   }
> {code}
> This is invalid call, can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream

2024-03-20 Thread ZanderXu (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17426.
-
Resolution: Fixed

> Remove Invalid FileSystemECReadStats logic in DFSInputStream
> 
>
> Key: HDFS-17426
> URL: https://issues.apache.org/jira/browse/HDFS-17426
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When read the replication file, the following logic will be called when 
> _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in 
> DFSInputStream.java 
> {code:java}
>  if (readStatistics.getBlockType() == BlockType.STRIPED) {
>   dfsClient.updateFileSystemECReadStats(nread);
>   }
> {code}
> This is invalid call, can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829353#comment-17829353
 ] 

ASF GitHub Bot commented on HDFS-17426:
---

ZanderXu commented on PR #6628:
URL: https://github.com/apache/hadoop/pull/6628#issuecomment-2011080558

   Merged.




> Remove Invalid FileSystemECReadStats logic in DFSInputStream
> 
>
> Key: HDFS-17426
> URL: https://issues.apache.org/jira/browse/HDFS-17426
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When read the replication file, the following logic will be called when 
> _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in 
> DFSInputStream.java 
> {code:java}
>  if (readStatistics.getBlockType() == BlockType.STRIPED) {
>   dfsClient.updateFileSystemECReadStats(nread);
>   }
> {code}
> This is invalid call, can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829352#comment-17829352
 ] 

ASF GitHub Bot commented on HDFS-17426:
---

ZanderXu merged PR #6628:
URL: https://github.com/apache/hadoop/pull/6628




> Remove Invalid FileSystemECReadStats logic in DFSInputStream
> 
>
> Key: HDFS-17426
> URL: https://issues.apache.org/jira/browse/HDFS-17426
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When read the replication file, the following logic will be called when 
> _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in 
> DFSInputStream.java 
> {code:java}
>  if (readStatistics.getBlockType() == BlockType.STRIPED) {
>   dfsClient.updateFileSystemECReadStats(nread);
>   }
> {code}
> This is invalid call, can remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17103) messy file system cleanup in TestNameEditsConfigs

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829323#comment-17829323
 ] 

ASF GitHub Bot commented on HDFS-17103:
---

teamconfx commented on PR #6071:
URL: https://github.com/apache/hadoop/pull/6071#issuecomment-2010748510

   Hi @ayushtkn are we able to merge this?




> messy file system cleanup in TestNameEditsConfigs
> -
>
> Key: HDFS-17103
> URL: https://issues.apache.org/jira/browse/HDFS-17103
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got a {{NullPointerException}} without message when running 
> {{{}TestNameEditsConfigs{}}}.
> h2. Where's the bug:
> In line 450 of {{{}TestNameEditsConfigs{}}}, the test attempts to cleanup the 
> file system:
>  
> {noformat}
>       ...
>       fileSys = cluster.getFileSystem();
>       ...
>     } finally  {
>       fileSys.close();
>       cluster.shutdown();
>     }{noformat}
> However, the cleanup would result in a {{NullPointerException}} that covers 
> up the actual exception if the initialization of {{fileSys}} fails or another 
> exception is thrown before the line that initializes {{{}fileSys{}}}.
> h2. How to reproduce:
> (1) Set {{dfs.namenode.maintenance.replication.min}} to {{-1155969698}}
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNameEditsConfigs#testNameEditsConfigsFailure}}
> h2. Stacktrace:
> {noformat}
> java.lang.NullPointerException,
>         at 
> org.apache.hadoop.hdfs.server.namenode.TestNameEditsConfigs.testNameEditsConfigsFailure(TestNameEditsConfigs.java:450),{noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17109) Null Pointer Exception when running TestBlockManager

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829321#comment-17829321
 ] 

ASF GitHub Bot commented on HDFS-17109:
---

teamconfx commented on PR #6046:
URL: https://github.com/apache/hadoop/pull/6046#issuecomment-2010741429

   Hi @goiri is there anything else I can do to make this PR merged?




> Null Pointer Exception when running TestBlockManager
> 
>
> Key: HDFS-17109
> URL: https://issues.apache.org/jira/browse/HDFS-17109
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Attachments: reproduce.sh
>
>
> h2. What happened
> After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, 
> running test 
> {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
>  results in a {{{}NullPointerException{}}}.
> h2. Where's the bug
> In the class {{{}BlockPlacementPolicyDefault{}}}:
> {noformat}
>     for (StorageType s : storageTypes) {
>       StorageTypeStats storageTypeStats = storageStats.get(s);
>       numNodes += storageTypeStats.getNodesInService();
>       numXceiver += storageTypeStats.getNodesInServiceXceiverCount();
>     }{noformat}
> However, the class does not check if the storageTypeStats is null, causing 
> the NPE.
> h2. How to reproduce
>  # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}}
>  # Run 
> {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
> and the following exception should be observed:
> {noformat}
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat}
>  
> For an easy reproduction, run the reproduce.sh in the attachment.  
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17099) Null Pointer Exception when stop namesystem in HDFS

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829319#comment-17829319
 ] 

ASF GitHub Bot commented on HDFS-17099:
---

teamconfx commented on PR #6034:
URL: https://github.com/apache/hadoop/pull/6034#issuecomment-2010737838

   Hi @ayushtkn @Hexiaoqiao, are we able to merge this PR if it looks good to 
you? ;)




> Null Pointer Exception when stop namesystem in HDFS
> ---
>
> Key: HDFS-17099
> URL: https://issues.apache.org/jira/browse/HDFS-17099
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Assignee: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Attachments: reproduce.sh
>
>
> h2. What happend:
> Got NullPointerException when stop namesystem in HDFS.
> h2. Buggy code:
>  
> {code:java}
>   void stopActiveServices() {
>     ...
>     if (dir != null && getFSImage() != null) {
>       if (getFSImage().editLog != null) {    // <--- Check whether editLog is 
> null
>         getFSImage().editLog.close();
>       }
>       // Update the fsimage with the last txid that we wrote
>       // so that the tailer starts from the right spot.
>       getFSImage().updateLastAppliedTxIdFromWritten(); // <--- BUG: Even if 
> editLog is null, this line will still be executed and cause nullpointer 
> exception
>     }
>     ...
>   }  public void updateLastAppliedTxIdFromWritten() {
>     this.lastAppliedTxId = editLog.getLastWrittenTxId();  // < This will 
> cause nullpointer exception if editLog is null
>   } {code}
> h2. StackTrace:
>  
> {code:java}
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.updateLastAppliedTxIdFromWritten(FSImage.java:1553)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:1463)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.close(FSNamesystem.java:1815)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:1017)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:248)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:194)
>         at 
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:181)
>  {code}
> h2. How to reproduce:
> (1) Set {{dfs.namenode.top.windows.minutes}} to {{{}37914516,32,0{}}}; or set 
> {{dfs.namenode.top.window.num.buckets}} to {{{}244111242{}}}.
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame#testSecondaryNameNodeXFrame}}
> h2. What's more:
> I'm still investigating how the parameter 
> {{dfs.namenode.top.windows.minutes}} triggered the buggy code.
>  
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17098) DatanodeManager does not handle null storage type properly

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829316#comment-17829316
 ] 

ASF GitHub Bot commented on HDFS-17098:
---

teamconfx commented on code in PR #6035:
URL: https://github.com/apache/hadoop/pull/6035#discussion_r1532946326


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -666,7 +666,15 @@ private Consumer> 
createSecondaryNodeSorter() {
 Consumer> secondarySort = null;
 if (readConsiderStorageType) {
   Comparator comp =
-  Comparator.comparing(DatanodeInfoWithStorage::getStorageType);
+  Comparator.comparing(DatanodeInfoWithStorage::getStorageType, (s1, 
s2) -> {
+  if (s1 == null) {

Review Comment:
   @Hexiaoqiao we got this when we set the following configuration 
"dfs.heartbeat.interval=1753310367" and 
   "dfs.namenode.read.considerStorageType=true". Under this config, the test 
"org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs"
 would trigger the case.





> DatanodeManager does not handle null storage type properly
> --
>
> Key: HDFS-17098
> URL: https://issues.apache.org/jira/browse/HDFS-17098
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got a {{NullPointerException}} without message when sorting datanodes in 
> {{{}NetworkTopology{}}}.
> h2. Where's the bug:
> In line 654 of {{{}DatanodeManager{}}}, the manager creates a second sorter 
> using the standard {{Comparator}} class:
> {noformat}
> Comparator comp =
>         Comparator.comparing(DatanodeInfoWithStorage::getStorageType);
> secondarySort = list -> Collections.sort(list, comp);{noformat}
> This comparator is then used in {{NetworkTopology}} as a secondary sort to 
> break ties:
> {noformat}
> if (secondarySort != null) {
>         // a secondary sort breaks the tie between nodes.
>         secondarySort.accept(nodesList);
> }{noformat}
> However, if the storage type is {{{}null{}}}, a {{NullPointerException}} 
> would be thrown since the default {{Comparator.comparing}} cannot handle 
> comparison between null values.
> h2. How to reproduce:
> (1) Set {{dfs.heartbeat.interval}} to {{{}1753310367{}}}, and 
> {{dfs.namenode.read.considerStorageType}} to {{true}}
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock#testAviodStaleAndSlowDatanodes}}
> h2. Stacktrace:
> {noformat}
> java.lang.NullPointerException
>     at 
> java.base/java.util.Comparator.lambda$comparing$77a9974f$1(Comparator.java:469)
>     at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>     at java.base/java.util.TimSort.sort(TimSort.java:220)
>     at java.base/java.util.Arrays.sort(Arrays.java:1515)
>     at java.base/java.util.ArrayList.sort(ArrayList.java:1750)
>     at java.base/java.util.Collections.sort(Collections.java:179)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.lambda$createSecondaryNodeSorter$0(DatanodeManager.java:654)
>     at 
> org.apache.hadoop.net.NetworkTopology.sortByDistance(NetworkTopology.java:983)
>     at 
> org.apache.hadoop.net.NetworkTopology.sortByDistanceUsingNetworkLocation(NetworkTopology.java:946)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlock(DatanodeManager.java:637)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:554)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock.testAviodStaleAndSlowDatanodes(TestSortLocatedBlock.java:144){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment. We are 
> happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17098) DatanodeManager does not handle null storage type properly

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829314#comment-17829314
 ] 

ASF GitHub Bot commented on HDFS-17098:
---

teamconfx commented on code in PR #6035:
URL: https://github.com/apache/hadoop/pull/6035#discussion_r1532946326


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -666,7 +666,15 @@ private Consumer> 
createSecondaryNodeSorter() {
 Consumer> secondarySort = null;
 if (readConsiderStorageType) {
   Comparator comp =
-  Comparator.comparing(DatanodeInfoWithStorage::getStorageType);
+  Comparator.comparing(DatanodeInfoWithStorage::getStorageType, (s1, 
s2) -> {
+  if (s1 == null) {

Review Comment:
   @Hexiaoqiao we got this when we set the following configuration 
"dfs.heartbeat.interval=1753310367" and 
   "dfs.namenode.read.considerStorageType="true". Under this config, the test 
"org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs"
 would trigger the case.





> DatanodeManager does not handle null storage type properly
> --
>
> Key: HDFS-17098
> URL: https://issues.apache.org/jira/browse/HDFS-17098
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
>  Labels: pull-request-available
> Attachments: reproduce.sh
>
>
> h2. What happened:
> Got a {{NullPointerException}} without message when sorting datanodes in 
> {{{}NetworkTopology{}}}.
> h2. Where's the bug:
> In line 654 of {{{}DatanodeManager{}}}, the manager creates a second sorter 
> using the standard {{Comparator}} class:
> {noformat}
> Comparator comp =
>         Comparator.comparing(DatanodeInfoWithStorage::getStorageType);
> secondarySort = list -> Collections.sort(list, comp);{noformat}
> This comparator is then used in {{NetworkTopology}} as a secondary sort to 
> break ties:
> {noformat}
> if (secondarySort != null) {
>         // a secondary sort breaks the tie between nodes.
>         secondarySort.accept(nodesList);
> }{noformat}
> However, if the storage type is {{{}null{}}}, a {{NullPointerException}} 
> would be thrown since the default {{Comparator.comparing}} cannot handle 
> comparison between null values.
> h2. How to reproduce:
> (1) Set {{dfs.heartbeat.interval}} to {{{}1753310367{}}}, and 
> {{dfs.namenode.read.considerStorageType}} to {{true}}
> (2) Run test: 
> {{org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock#testAviodStaleAndSlowDatanodes}}
> h2. Stacktrace:
> {noformat}
> java.lang.NullPointerException
>     at 
> java.base/java.util.Comparator.lambda$comparing$77a9974f$1(Comparator.java:469)
>     at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>     at java.base/java.util.TimSort.sort(TimSort.java:220)
>     at java.base/java.util.Arrays.sort(Arrays.java:1515)
>     at java.base/java.util.ArrayList.sort(ArrayList.java:1750)
>     at java.base/java.util.Collections.sort(Collections.java:179)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.lambda$createSecondaryNodeSorter$0(DatanodeManager.java:654)
>     at 
> org.apache.hadoop.net.NetworkTopology.sortByDistance(NetworkTopology.java:983)
>     at 
> org.apache.hadoop.net.NetworkTopology.sortByDistanceUsingNetworkLocation(NetworkTopology.java:946)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlock(DatanodeManager.java:637)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:554)
>     at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock.testAviodStaleAndSlowDatanodes(TestSortLocatedBlock.java:144){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment. We are 
> happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829249#comment-17829249
 ] 

ASF GitHub Bot commented on HDFS-17433:
---

hadoop-yetus commented on PR #6644:
URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2010083682

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 31s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m  7s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 15s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 57s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  7s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 58s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 16s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 44s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 227m 51s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 45s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 366m 24s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6644 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 9e01e979dce9 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 2b7a5f58664d3c4467817b5f8b150e66ab71a6ba |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/testReport/ |
   | Max. process+thread count | 4141 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetu

[jira] [Commented] (HDFS-17423) [FGL] BlockManagerSafeMode supports fine-grained lock

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828953#comment-17828953
 ] 

ASF GitHub Bot commented on HDFS-17423:
---

hadoop-yetus commented on PR #6645:
URL: https://github.com/apache/hadoop/pull/6645#issuecomment-2009645872

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 30s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ HDFS-17384 Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 56s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  HDFS-17384 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  HDFS-17384 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  mvnsite  |   1m 24s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  HDFS-17384 passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 41s |  |  HDFS-17384 passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 14s |  |  HDFS-17384 passed  |
   | +1 :green_heart: |  shadedclient  |  35m 26s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m  9s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   1m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 59s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  35m 10s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 230m 46s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6645/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 46s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 370m 34s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockManager |
   |   | hadoop.hdfs.server.datanode.TestLargeBlockReport |
   |   | hadoop.hdfs.server.blockmanagement.TestBlockManagerSafeMode |
   |   | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy |
   |   | hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand |
   |   | hadoop.hdfs.protocol.TestBlockListAsLongs |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6645/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6645 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e66939c4dace 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | HDFS-17384 / 274f1b4e0a04ffc64e02ad3438869e8ebe761026 |
   | Default Java | Private Bui

[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinyuren updated HDFS-17434:

Description: 
In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
network card bandwidth is 2Mb/s.

!image-2024-03-20-19-10-13-016.png|width=662,height=135!

!image-2024-03-20-19-55-18-378.png!

By adding log printing, it turns out that the Selector.select function has 
significant overhead.

!image-2024-03-20-19-22-29-829.png|width=474,height=262!

!image-2024-03-20-19-24-02-233.png|width=445,height=181!

I would like to know if this falls within the normal range or how we can 
improve it.

 

  was:
In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
network card bandwidth is 10Gb/s.

!image-2024-03-20-19-10-13-016.png|width=662,height=135!

By adding log printing, it turns out that the Selector.select function has 
significant overhead.

!image-2024-03-20-19-22-29-829.png|width=474,height=262!

!image-2024-03-20-19-24-02-233.png|width=445,height=181!

I would like to know if this falls within the normal range or how we can 
improve it.

 


> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, 
> image-2024-03-20-19-55-18-378.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 2Mb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> !image-2024-03-20-19-55-18-378.png!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinyuren updated HDFS-17434:

Issue Type: Wish  (was: Task)

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 10Gb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828925#comment-17828925
 ] 

qinyuren commented on HDFS-17434:
-

[~hexiaoqiao] [~tasanuma] [~zanderxu] 

Please take a look.

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 10Gb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17129) mis-order of ibr and fbr on datanode

2024-03-20 Thread Xiping Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated HDFS-17129:

Attachment: image-2024-03-20-18-07-42-155.png

> mis-order of ibr and fbr on datanode 
> -
>
> Key: HDFS-17129
> URL: https://issues.apache.org/jira/browse/HDFS-17129
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.4.0, 3.3.9, 3.3.6
> Environment: hdfs3.4.0
>Reporter: liuguanghua
>Assignee: liuguanghua
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: image-2024-03-20-18-07-42-155.png
>
>
> HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. 
> But it maybe casue the mis-order of ibr and fbr



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinyuren updated HDFS-17434:

Attachment: image-2024-03-20-19-55-18-378.png

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, 
> image-2024-03-20-19-55-18-378.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 10Gb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinyuren updated HDFS-17434:

Issue Type: Test  (was: Wish)

> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 10Gb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828907#comment-17828907
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on code in PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531806620


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:
##
@@ -233,41 +235,62 @@ private ByteBufferStrategy[] 
getReadStrategies(StripingChunk chunk) {
 
   private int readToBuffer(BlockReader blockReader,
   DatanodeInfo currentNode, ByteBufferStrategy strategy,
-  ExtendedBlock currentBlock) throws IOException {
+  LocatedBlock currentBlock, int chunkIndex) throws IOException {
 final int targetLength = strategy.getTargetLength();
-int length = 0;
-try {
-  while (length < targetLength) {
-int ret = strategy.readFromBlock(blockReader);
-if (ret < 0) {
-  throw new IOException("Unexpected EOS from the reader");
+int curAttempts = 0;
+while (curAttempts < readDNMaxAttempts) {
+  curAttempts++;
+  int length = 0;
+  try {
+while (length < targetLength) {
+  int ret = strategy.readFromBlock(blockReader);
+  if (ret < 0) {
+throw new IOException("Unexpected EOS from the reader");
+  }
+  length += ret;
+}
+return length;
+  } catch (ChecksumException ce) {
+DFSClient.LOG.warn("Found Checksum error for "
++ currentBlock + " from " + currentNode
++ " at " + ce.getPos());
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+// we want to remember which block replicas we have tried
+corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), 
currentNode);
+throw ce;
+  } catch (IOException e) {
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+if (curAttempts < readDNMaxAttempts) {
+  if (readerInfos[chunkIndex].reader != null) {
+readerInfos[chunkIndex].reader.close();
+  }
+  if (dfsStripedInputStream.createBlockReader(currentBlock,
+  alignedStripe.getOffsetInBlock(), targetBlocks,

Review Comment:
   Hi @Neilxzn @Hexiaoqiao @ayushtkn @zhangshuyan0 @ZanderXu what dou you think?
   Please also help to look into this issue when you have free time , thanks~





> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + "

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828905#comment-17828905
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on code in PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531799282


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:
##
@@ -233,41 +235,62 @@ private ByteBufferStrategy[] 
getReadStrategies(StripingChunk chunk) {
 
   private int readToBuffer(BlockReader blockReader,
   DatanodeInfo currentNode, ByteBufferStrategy strategy,
-  ExtendedBlock currentBlock) throws IOException {
+  LocatedBlock currentBlock, int chunkIndex) throws IOException {
 final int targetLength = strategy.getTargetLength();
-int length = 0;
-try {
-  while (length < targetLength) {
-int ret = strategy.readFromBlock(blockReader);
-if (ret < 0) {
-  throw new IOException("Unexpected EOS from the reader");
+int curAttempts = 0;
+while (curAttempts < readDNMaxAttempts) {
+  curAttempts++;
+  int length = 0;
+  try {
+while (length < targetLength) {
+  int ret = strategy.readFromBlock(blockReader);
+  if (ret < 0) {
+throw new IOException("Unexpected EOS from the reader");
+  }
+  length += ret;
+}
+return length;
+  } catch (ChecksumException ce) {
+DFSClient.LOG.warn("Found Checksum error for "
++ currentBlock + " from " + currentNode
++ " at " + ce.getPos());
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+// we want to remember which block replicas we have tried
+corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), 
currentNode);
+throw ce;
+  } catch (IOException e) {
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+if (curAttempts < readDNMaxAttempts) {
+  if (readerInfos[chunkIndex].reader != null) {
+readerInfos[chunkIndex].reader.close();
+  }
+  if (dfsStripedInputStream.createBlockReader(currentBlock,
+  alignedStripe.getOffsetInBlock(), targetBlocks,

Review Comment:
   ```
if (dfsStripedInputStream.createBlockReader(currentBlock,
 offsetInBlock, targetBlocks,
   ```





> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> T

[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinyuren updated HDFS-17434:

Description: 
In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
network card bandwidth is 10Gb/s.

!image-2024-03-20-19-10-13-016.png|width=662,height=135!

By adding log printing, it turns out that the Selector.select function has 
significant overhead.

!image-2024-03-20-19-22-29-829.png|width=474,height=262!

!image-2024-03-20-19-24-02-233.png|width=445,height=181!

I would like to know if this falls within the normal range or how we can 
improve it.

 

  was:
In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
from 5ms to 10ms, exceeding the usual disk reading overhead.

!image-2024-03-20-19-10-13-016.png|width=662,height=135!

By adding log printing, it turns out that the Selector.select function has 
significant overhead.

!image-2024-03-20-19-22-29-829.png|width=474,height=262!

!image-2024-03-20-19-24-02-233.png|width=445,height=181!

I would like to know if this falls within the normal range or how we can 
improve it.

 


> Selector.select in SocketIOWithTimeout.java has significant overhead
> 
>
> Key: HDFS-17434
> URL: https://issues.apache.org/jira/browse/HDFS-17434
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: qinyuren
>Priority: Major
> Attachments: image-2024-03-20-19-10-13-016.png, 
> image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png
>
>
> In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
> from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine 
> network card bandwidth is 10Gb/s.
> !image-2024-03-20-19-10-13-016.png|width=662,height=135!
> By adding log printing, it turns out that the Selector.select function has 
> significant overhead.
> !image-2024-03-20-19-22-29-829.png|width=474,height=262!
> !image-2024-03-20-19-24-02-233.png|width=445,height=181!
> I would like to know if this falls within the normal range or how we can 
> improve it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2024-03-20 Thread Xiping Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828914#comment-17828914
 ] 

Xiping Zhang commented on HDFS-16016:
-

HDFS-16016 is a good improvement, and in our production environment, we have 
some large DN with 24 disks and total blocks reaching more than 10 million. 
With the development of hardware, it is possible that the DN will become 
larger, and if the FBR and IBR are coupled together, the impact on the service 
is great. HDFS-16016 can solve exactly this scaling problem of DN. For this 
issue HDFS-17129, I have a solution, which is to redefine the semantics of the 
FBR. Instead of requiring DN to align all blocks with Namenode by 100% in FBR 
this time, we only need to compare all blocks before the last block of FBR, 
although the FBR missed some blocks from the incremental report.I've drawn a 
diagram for ease of understanding：
 * step1：It is the NN processing blockreport process before HDFS-16016 is 
upgraded
 * step2：NN handles the blockreport process before upgrading HDFS-16016, but 
there will be problems HDFS-17129
 * step3：We operate only on the blocks before the last zero bound point of the 
FBR
 * step4：Blocks not manipulated by the previous FBR are processed by the next 
FBR, unless the DN does not add any new blocks between FBRS

!image-2024-03-20-18-31-23-937.png!

[~liuguanghua]  [~hexiaoqiao]  [~tasanuma]  hello, Do you have any good 
suggestions for me to understand FBR now and make this plan?  Using lock 
restrictions here would be like going back to square one. If we use this 
solution, we only need to remove the remaining to_remove block, and we only 
need to remove one piece of code.

 

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, 
> image-2024-03-20-18-31-23-937.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828912#comment-17828912
 ] 

ASF GitHub Bot commented on HDFS-17430:
---

Hexiaoqiao commented on code in PR #6635:
URL: https://github.com/apache/hadoop/pull/6635#discussion_r1531824148


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -1755,12 +1755,24 @@ private BlockRecoveryCommand 
getBlockRecoveryCommand(String blockPoolId,
   LOG.info("Skipped stale nodes for recovery : "
   + (storages.length - recoveryLocations.size()));
 }
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
   } else {
-// If too many replicas are stale, then choose all replicas to
+// If too many replicas are stale, then choose live replicas to
 // participate in block recovery.
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
+recoveryLocations.clear();
+storageIdx.clear();
+for (int i = 0; i < storages.length; ++i) {
+  if (storages[i].getDatanodeDescriptor().isAlive()) {

Review Comment:
   What about add this condition to L1736~L1740 together? 



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java:
##
@@ -1755,12 +1755,24 @@ private BlockRecoveryCommand 
getBlockRecoveryCommand(String blockPoolId,
   LOG.info("Skipped stale nodes for recovery : "
   + (storages.length - recoveryLocations.size()));
 }
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
   } else {
-// If too many replicas are stale, then choose all replicas to
+// If too many replicas are stale, then choose live replicas to
 // participate in block recovery.
-recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages);
+recoveryLocations.clear();
+storageIdx.clear();
+for (int i = 0; i < storages.length; ++i) {
+  if (storages[i].getDatanodeDescriptor().isAlive()) {
+recoveryLocations.add(storages[i]);
+storageIdx.add(i);
+  }
+}
+assert recoveryLocations.size() > 0 : "recoveryLocations size should 
be > 0";

Review Comment:
   Is this assert necessary here, or `recoveryLocations` size could be 0 if all 
DataNodes are not alive?





> RecoveringBlock will skip no live replicas when get block recovery command.
> ---
>
> Key: HDFS-17430
> URL: https://issues.apache.org/jira/browse/HDFS-17430
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> RecoveringBlock maybe skip no live replicas when get block recovery command.
> *Issue:*
> Currently the following scenarios may lead to failure in the execution of 
> BlockRecoveryWorker by the datanode, resulting file being not to be closed 
> for a long time.
> *t1.*  The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down 
> and will be dead status, the dn2 is live status.
> *t2.* Occurs block recovery.
> related logs：
> {code:java}
> 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* 
> NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease 
> recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx
> {code}
> *t3.*  The dn2 is chosen for block recovery.
> dn1 is marked as stale (is dead state) at this time, here the 
> recoveryLocations size is 1, currently according to the following logic, dn1 
> and dn2 will be chosen to participate in block recovery.
> DatanodeManager#getBlockRecoveryCommand
> {code:java}
>// Skip stale nodes during recovery
>  final List recoveryLocations =
>  new ArrayList<>(storages.length);
>  final List storageIdx = new ArrayList<>(storages.length);
>  for (int i = 0; i < storages.length; ++i) {
>if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
>  recoveryLocations.add(storages[i]);
>  storageIdx.add(i);
>}
>  }
>  ...
>  // If we only get 1 replica after eliminating stale nodes, choose all
>  // replicas for recovery and let the primary data node handle failures.
>  DatanodeInfo[] recoveryInfos;
>  if (recoveryLocations.size() > 1) {
>if (recoveryLocations.size() != storages.length) {
>  LOG.info("Skipped stale nodes for recovery : "
>  + (storages.length - recoveryLocations.size()));
>}
>recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations);
>  } else {
>

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828906#comment-17828906
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on code in PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531803838


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:
##
@@ -233,41 +235,62 @@ private ByteBufferStrategy[] 
getReadStrategies(StripingChunk chunk) {
 
   private int readToBuffer(BlockReader blockReader,
   DatanodeInfo currentNode, ByteBufferStrategy strategy,
-  ExtendedBlock currentBlock) throws IOException {
+  LocatedBlock currentBlock, int chunkIndex) throws IOException {
 final int targetLength = strategy.getTargetLength();
-int length = 0;
-try {
-  while (length < targetLength) {
-int ret = strategy.readFromBlock(blockReader);
-if (ret < 0) {
-  throw new IOException("Unexpected EOS from the reader");
+int curAttempts = 0;
+while (curAttempts < readDNMaxAttempts) {
+  curAttempts++;
+  int length = 0;
+  try {
+while (length < targetLength) {
+  int ret = strategy.readFromBlock(blockReader);
+  if (ret < 0) {
+throw new IOException("Unexpected EOS from the reader");
+  }
+  length += ret;
+}
+return length;
+  } catch (ChecksumException ce) {
+DFSClient.LOG.warn("Found Checksum error for "
++ currentBlock + " from " + currentNode
++ " at " + ce.getPos());
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+// we want to remember which block replicas we have tried
+corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), 
currentNode);
+throw ce;
+  } catch (IOException e) {
+//Clear buffer to make next decode success
+strategy.getReadBuffer().clear();
+if (curAttempts < readDNMaxAttempts) {
+  if (readerInfos[chunkIndex].reader != null) {
+readerInfos[chunkIndex].reader.close();
+  }
+  if (dfsStripedInputStream.createBlockReader(currentBlock,
+  alignedStripe.getOffsetInBlock(), targetBlocks,

Review Comment:
   If use pread to read data, if the currently set buffer size is a block size,
   For a block in a dn, the data of multiple cell units may be read, so the 
size of the ByteBufferStrategy array in the StripingChunk corresponding to the 
AlignedStripe is calculated to be multiple (there are multiple List 
slices in ChunkByteBuffer),
   
   https://github.com/apache/hadoop/assets/3760130/40f7a944-ea57-4891-9719-86a1b009244d";>
   
   So when processing retry createBlockReader in readToBuffer, we may need to 
consider the current actual offsetInBlock to avoid reading duplicate data from 
datanode.





> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>

[jira] [Created] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead

2024-03-20 Thread qinyuren (Jira)

qinyuren created HDFS-17434:
---

 Summary: Selector.select in SocketIOWithTimeout.java has 
significant overhead
 Key: HDFS-17434
 URL: https://issues.apache.org/jira/browse/HDFS-17434
 Project: Hadoop HDFS
  Issue Type: Task
Reporter: qinyuren
 Attachments: image-2024-03-20-19-10-13-016.png, 
image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png

In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges 
from 5ms to 10ms, exceeding the usual disk reading overhead.

!image-2024-03-20-19-10-13-016.png|width=662,height=135!

By adding log printing, it turns out that the Selector.select function has 
significant overhead.

!image-2024-03-20-19-22-29-829.png|width=474,height=262!

!image-2024-03-20-19-24-02-233.png|width=445,height=181!

I would like to know if this falls within the normal range or how we can 
improve it.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828902#comment-17828902
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on code in PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531794650


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:
##
@@ -284,7 +307,8 @@ private Callable readCells(final 
BlockReader reader,
 
   int ret = 0;
   for (ByteBufferStrategy strategy : strategies) {
-int bytesReead = readToBuffer(reader, datanode, strategy, 
currentBlock);
+int bytesReead = readToBuffer(reader, datanode, strategy, currentBlock,
+chunkIndex);

Review Comment:
   For `readToBuffer` maybe need to consider the current actual offsetInBlock.
   
   `readToBuffer(reader, datanode, strategy, currentBlock, chunkIndex, ret);`





> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /10.128.14.46:9866); java.net.SocketTi

[jira] [Updated] (HDFS-16016) BPServiceActor add a new thread to handle IBR

2024-03-20 Thread Xiping Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiping Zhang updated HDFS-16016:

Attachment: image-2024-03-20-18-31-23-937.png

> BPServiceActor add a new thread to handle IBR
> -
>
> Key: HDFS-16016
> URL: https://issues.apache.org/jira/browse/HDFS-16016
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: JiangHua Zhu
>Assignee: Viraj Jasani
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.3.6
>
> Attachments: image-2023-11-03-18-11-54-502.png, 
> image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, 
> image-2024-03-20-18-31-23-937.png
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. 
> We can handle IBR independently to improve the performance of heartbeat and 
> FBR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828904#comment-17828904
 ] 

ASF GitHub Bot commented on HDFS-15413:
---

haiyang1987 commented on code in PR #5829:
URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531798611


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:
##
@@ -233,41 +235,62 @@ private ByteBufferStrategy[] 
getReadStrategies(StripingChunk chunk) {
 
   private int readToBuffer(BlockReader blockReader,
   DatanodeInfo currentNode, ByteBufferStrategy strategy,
-  ExtendedBlock currentBlock) throws IOException {
+  LocatedBlock currentBlock, int chunkIndex) throws IOException {

Review Comment:
   ```
   private int readToBuffer(BlockReader blockReader,
 DatanodeInfo currentNode, ByteBufferStrategy strategy,
 LocatedBlock currentBlock, int chunkIndex, long offsetInBlock)
   ```





> DFSStripedInputStream throws exception when datanodes close idle connections
> 
>
> Key: HDFS-15413
> URL: https://issues.apache.org/jira/browse/HDFS-15413
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding, hdfs-client
>Affects Versions: 3.1.3
> Environment: - Hadoop 3.1.3
> - erasure coding with ISA-L and RS-3-2-1024k scheme
> - running in kubernetes
> - dfs.client.socket-timeout = 1
> - dfs.datanode.socket.write.timeout = 1
>Reporter: Andrey Elenskiy
>Priority: Critical
>  Labels: pull-request-available
> Attachments: out.log
>
>
> We've run into an issue with compactions failing in HBase when erasure coding 
> is enabled on a table directory. After digging further I was able to narrow 
> it down to a seek + read logic and able to reproduce the issue with hdfs 
> client only:
> {code:java}
> import org.apache.hadoop.conf.Configuration;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.fs.FileSystem;
> import org.apache.hadoop.fs.FSDataInputStream;
> public class ReaderRaw {
> public static void main(final String[] args) throws Exception {
> Path p = new Path(args[0]);
> int bufLen = Integer.parseInt(args[1]);
> int sleepDuration = Integer.parseInt(args[2]);
> int countBeforeSleep = Integer.parseInt(args[3]);
> int countAfterSleep = Integer.parseInt(args[4]);
> Configuration conf = new Configuration();
> FSDataInputStream istream = FileSystem.get(conf).open(p);
> byte[] buf = new byte[bufLen];
> int readTotal = 0;
> int count = 0;
> try {
>   while (true) {
> istream.seek(readTotal);
> int bytesRemaining = bufLen;
> int bufOffset = 0;
> while (bytesRemaining > 0) {
>   int nread = istream.read(buf, 0, bufLen);
>   if (nread < 0) {
>   throw new Exception("nread is less than zero");
>   }
>   readTotal += nread;
>   bufOffset += nread;
>   bytesRemaining -= nread;
> }
> count++;
> if (count == countBeforeSleep) {
> System.out.println("sleeping for " + sleepDuration + " 
> milliseconds");
> Thread.sleep(sleepDuration);
> System.out.println("resuming");
> }
> if (count == countBeforeSleep + countAfterSleep) {
> System.out.println("done");
> break;
> }
>   }
> } catch (Exception e) {
> System.out.println("exception on read " + count + " read total " 
> + readTotal);
> throw e;
> }
> }
> }
> {code}
> The issue appears to be due to the fact that datanodes close the connection 
> of EC client if it doesn't fetch next packet for longer than 
> dfs.client.socket-timeout. The EC client doesn't retry and instead assumes 
> that those datanodes went away resulting in "missing blocks" exception.
> I was able to consistently reproduce with the following arguments:
> {noformat}
> bufLen = 100 (just below 1MB which is the size of the stripe) 
> sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000)
> countBeforeSleep = 1
> countAfterSleep = 7
> {noformat}
> I've attached the entire log output of running the snippet above against 
> erasure coded file with RS-3-2-1024k policy. And here are the logs from 
> datanodes of disconnecting the client:
> datanode 1:
> {noformat}
> 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped 
> reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver 
> error processing READ_BLOCK operation  src: /10.128.23.40:53748 dst: 
> /

[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands

2024-03-20 Thread ASF GitHub Bot (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828676#comment-17828676
 ] 

ASF GitHub Bot commented on HDFS-17433:
---

hadoop-yetus commented on PR #6644:
URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2009083667

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  12m 41s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 14s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 10s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  5s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   1m  5s |  |  the patch passed  |
   | -1 :x: |  blanks  |   0m  0s | 
[/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/artifact/out/blanks-eol.txt)
 |  The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix 
<>. Refer https://git-scm.com/docs/git-apply  |
   | +1 :green_heart: |  checkstyle  |   0m 58s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 11s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 54s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 229m  1s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 46s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 381m  9s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.44 ServerAPI=1.44 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6644 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux fdc3d112ff7a 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 
13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 31038f1ddc0aa1c3f2c7803bcc47b3418d200be7 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/testReport/ |
   | Max. process+thread count | 4051 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop

[jira] [Updated] (HDFS-17423) [FGL] BlockManagerSafeMode supports fine-grained lock

2024-03-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17423:
--
Labels: pull-request-available  (was: )

> [FGL] BlockManagerSafeMode supports fine-grained lock
> -
>
> Key: HDFS-17423
> URL: https://issues.apache.org/jira/browse/HDFS-17423
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> [FGL] BlockManagerSafeMode supports fine-grained lock



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

42 matches

Mail list logo