[jira] [Resolved] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
[ https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma resolved HDFS-17432. - Fix Version/s: 3.4.1 3.5.0 Resolution: Fixed > Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf > - > > Key: HDFS-17432 > URL: https://issues.apache.org/jira/browse/HDFS-17432 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > Fix For: 3.4.1, 3.5.0 > > > After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable > both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to > the hadoop-hdfs-rbf/pom.xml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
[ https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829384#comment-17829384 ] ASF GitHub Bot commented on HDFS-17432: --- tasanuma commented on PR #6639: URL: https://github.com/apache/hadoop/pull/6639#issuecomment-2011271575 Thanks again for your review, @dineshchitlangia. > Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf > - > > Key: HDFS-17432 > URL: https://issues.apache.org/jira/browse/HDFS-17432 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > > After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable > both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to > the hadoop-hdfs-rbf/pom.xml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
[ https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829383#comment-17829383 ] ASF GitHub Bot commented on HDFS-17432: --- tasanuma merged PR #6639: URL: https://github.com/apache/hadoop/pull/6639 > Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf > - > > Key: HDFS-17432 > URL: https://issues.apache.org/jira/browse/HDFS-17432 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > > After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable > both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to > the hadoop-hdfs-rbf/pom.xml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17432) Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf
[ https://issues.apache.org/jira/browse/HDFS-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829381#comment-17829381 ] ASF GitHub Bot commented on HDFS-17432: --- tasanuma commented on PR #6639: URL: https://github.com/apache/hadoop/pull/6639#issuecomment-2011270228 The failed tests are caused by HDFS-17354. I created HDFS-17435 for addressing the issue. This PR doesn't cause the failed tests. So I'm merging it. > Fix junit dependency to enable JUnit4 tests to run in hadoop-hdfs-rbf > - > > Key: HDFS-17432 > URL: https://issues.apache.org/jira/browse/HDFS-17432 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > Labels: pull-request-available > > After HDFS-17370, JUnit4 tests stopped running in hadoop-hdfs-rbf. To enable > both JUnit4 and JUnit5 tests to run, we need to add junit-vintage-engine to > the hadoop-hdfs-rbf/pom.xml. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17435) Fix TestRouterRpc#testClearStaleNamespacesInRouterStateIdContext() failed
Takanobu Asanuma created HDFS-17435: --- Summary: Fix TestRouterRpc#testClearStaleNamespacesInRouterStateIdContext() failed Key: HDFS-17435 URL: https://issues.apache.org/jira/browse/HDFS-17435 Project: Hadoop HDFS Issue Type: Test Reporter: Takanobu Asanuma TestRouterRpc and TestRouterRpcMultiDestination are failing with the following error. {noformat} [ERROR] testProxyGetBlockKeys Time elapsed: 0.573 s <<< ERROR! org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: jenkins is not allowed to impersonate jenkins {noformat} This is caused by testClearStaleNamespacesInRouterStateIdContext() which is implemented by HDFS-17354. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Chitlangia resolved HDFS-17433. -- Fix Version/s: 3.5.0 Resolution: Fixed > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829368#comment-17829368 ] ASF GitHub Bot commented on HDFS-17433: --- dineshchitlangia merged PR #6644: URL: https://github.com/apache/hadoop/pull/6644 > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829367#comment-17829367 ] ASF GitHub Bot commented on HDFS-17433: --- dineshchitlangia commented on PR #6644: URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2011163090 Thanks @hfutatzhanghb for the contribution and @shardulsadavarte for the review. > metrics sumOfActorCommandQueueLength should only record valid commands > -- > > Key: HDFS-17433 > URL: https://issues.apache.org/jira/browse/HDFS-17433 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.4.0 >Reporter: farmmamba >Assignee: farmmamba >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829366#comment-17829366 ] farmmamba commented on HDFS-17434: -- [~qinyuren] Hi, could you please show your createRbw avgTime? > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Test >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, > image-2024-03-20-19-55-18-378.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 2Mb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > !image-2024-03-20-19-55-18-378.png! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17416) [FGL] Monitor threads in BlockManager.class support fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17416: -- Labels: pull-request-available (was: ) > [FGL] Monitor threads in BlockManager.class support fine-grained lock > - > > Key: HDFS-17416 > URL: https://issues.apache.org/jira/browse/HDFS-17416 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > There are some monitor threads in BlockManager.class. > > This ticket is used to make these threads supporting fine-grained locking. > * BlockReportProcessingThread > * MarkedDeleteBlockScrubber > * RedundancyMonitor > * Reconstruction Queue Initializer > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17416) [FGL] Monitor threads in BlockManager.class support fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829365#comment-17829365 ] ASF GitHub Bot commented on HDFS-17416: --- ZanderXu opened a new pull request, #6647: URL: https://github.com/apache/hadoop/pull/6647 Threads in BlockManager.class support fine-grained lock. - BlockReportProcessingThread - MarkedDeleteBlockScrubber - RedundancyMonitor - Reconstruction Queue Initializer Normally, these threads just need BMReadLock or BMWriteLock, but there are some cases still need FSReadLock and FSWriteLock. - UpdateQuota while completing one block - GetStoragePolicyId while removing excess replicas - GetFullPath while checking if it is snapshot > [FGL] Monitor threads in BlockManager.class support fine-grained lock > - > > Key: HDFS-17416 > URL: https://issues.apache.org/jira/browse/HDFS-17416 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > > There are some monitor threads in BlockManager.class. > > This ticket is used to make these threads supporting fine-grained locking. > * BlockReportProcessingThread > * MarkedDeleteBlockScrubber > * RedundancyMonitor > * Reconstruction Queue Initializer > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.
[ https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829364#comment-17829364 ] ASF GitHub Bot commented on HDFS-17430: --- haiyang1987 commented on code in PR #6635: URL: https://github.com/apache/hadoop/pull/6635#discussion_r1533200066 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -1755,12 +1755,24 @@ private BlockRecoveryCommand getBlockRecoveryCommand(String blockPoolId, LOG.info("Skipped stale nodes for recovery : " + (storages.length - recoveryLocations.size())); } -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); } else { -// If too many replicas are stale, then choose all replicas to +// If too many replicas are stale, then choose live replicas to // participate in block recovery. -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); +recoveryLocations.clear(); +storageIdx.clear(); +for (int i = 0; i < storages.length; ++i) { + if (storages[i].getDatanodeDescriptor().isAlive()) { +recoveryLocations.add(storages[i]); +storageIdx.add(i); + } +} +assert recoveryLocations.size() > 0 : "recoveryLocations size should be > 0"; Review Comment: Check the code again. when processing handleHeartbeat executes getBlockRecoveryCommand, the datanode should be in the live state at this time, so the size of recoveryLocations is at least 1. so here maybey remove this assert logic. > RecoveringBlock will skip no live replicas when get block recovery command. > --- > > Key: HDFS-17430 > URL: https://issues.apache.org/jira/browse/HDFS-17430 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > RecoveringBlock maybe skip no live replicas when get block recovery command. > *Issue:* > Currently the following scenarios may lead to failure in the execution of > BlockRecoveryWorker by the datanode, resulting file being not to be closed > for a long time. > *t1.* The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down > and will be dead status, the dn2 is live status. > *t2.* Occurs block recovery. > related logs: > {code:java} > 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* > NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease > recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx > {code} > *t3.* The dn2 is chosen for block recovery. > dn1 is marked as stale (is dead state) at this time, here the > recoveryLocations size is 1, currently according to the following logic, dn1 > and dn2 will be chosen to participate in block recovery. > DatanodeManager#getBlockRecoveryCommand > {code:java} >// Skip stale nodes during recovery > final List recoveryLocations = > new ArrayList<>(storages.length); > final List storageIdx = new ArrayList<>(storages.length); > for (int i = 0; i < storages.length; ++i) { >if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) { > recoveryLocations.add(storages[i]); > storageIdx.add(i); >} > } > ... > // If we only get 1 replica after eliminating stale nodes, choose all > // replicas for recovery and let the primary data node handle failures. > DatanodeInfo[] recoveryInfos; > if (recoveryLocations.size() > 1) { >if (recoveryLocations.size() != storages.length) { > LOG.info("Skipped stale nodes for recovery : " > + (storages.length - recoveryLocations.size())); >} >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); > } else { >// If too many replicas are stale, then choose all replicas to >// participate in block recovery. >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); > } > {code} > {code:java} > 2024-03-13 21:58:01,425 INFO datanode.DataNode > (BlockRecoveryWorker.java:logRecoverBlock(563)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > BlockRecoveryWorker: NameNode at xxx:8040 calls > recoverBlock(BP-xxx:blk_xxx_xxx, > targets=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]], > newGenerationStamp=28577373754, newBlock=null, isStriped=false) > {code} > *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call > initReplicaRecovery operation on dn1, however, since the dn1 machine is > currently down state at this time, it will t
[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.
[ https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829363#comment-17829363 ] ASF GitHub Bot commented on HDFS-17430: --- haiyang1987 commented on code in PR #6635: URL: https://github.com/apache/hadoop/pull/6635#discussion_r1533195923 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -1755,12 +1755,24 @@ private BlockRecoveryCommand getBlockRecoveryCommand(String blockPoolId, LOG.info("Skipped stale nodes for recovery : " + (storages.length - recoveryLocations.size())); } -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); } else { -// If too many replicas are stale, then choose all replicas to +// If too many replicas are stale, then choose live replicas to // participate in block recovery. -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); +recoveryLocations.clear(); +storageIdx.clear(); +for (int i = 0; i < storages.length; ++i) { + if (storages[i].getDatanodeDescriptor().isAlive()) { Review Comment: Thanks @Hexiaoqiao for you comment. Sir suggestion is here only will choose non stale and is live replicas to participate in block recovery ? > RecoveringBlock will skip no live replicas when get block recovery command. > --- > > Key: HDFS-17430 > URL: https://issues.apache.org/jira/browse/HDFS-17430 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > RecoveringBlock maybe skip no live replicas when get block recovery command. > *Issue:* > Currently the following scenarios may lead to failure in the execution of > BlockRecoveryWorker by the datanode, resulting file being not to be closed > for a long time. > *t1.* The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down > and will be dead status, the dn2 is live status. > *t2.* Occurs block recovery. > related logs: > {code:java} > 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* > NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease > recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx > {code} > *t3.* The dn2 is chosen for block recovery. > dn1 is marked as stale (is dead state) at this time, here the > recoveryLocations size is 1, currently according to the following logic, dn1 > and dn2 will be chosen to participate in block recovery. > DatanodeManager#getBlockRecoveryCommand > {code:java} >// Skip stale nodes during recovery > final List recoveryLocations = > new ArrayList<>(storages.length); > final List storageIdx = new ArrayList<>(storages.length); > for (int i = 0; i < storages.length; ++i) { >if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) { > recoveryLocations.add(storages[i]); > storageIdx.add(i); >} > } > ... > // If we only get 1 replica after eliminating stale nodes, choose all > // replicas for recovery and let the primary data node handle failures. > DatanodeInfo[] recoveryInfos; > if (recoveryLocations.size() > 1) { >if (recoveryLocations.size() != storages.length) { > LOG.info("Skipped stale nodes for recovery : " > + (storages.length - recoveryLocations.size())); >} >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); > } else { >// If too many replicas are stale, then choose all replicas to >// participate in block recovery. >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); > } > {code} > {code:java} > 2024-03-13 21:58:01,425 INFO datanode.DataNode > (BlockRecoveryWorker.java:logRecoverBlock(563)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1@54e291ac] - > BlockRecoveryWorker: NameNode at xxx:8040 calls > recoverBlock(BP-xxx:blk_xxx_xxx, > targets=[DatanodeInfoWithStorage[dn1:50010,null,null], > DatanodeInfoWithStorage[dn2:50010,null,null]], > newGenerationStamp=28577373754, newBlock=null, isStriped=false) > {code} > *t4.* When dn2 executes BlockRecoveryWorker#recover, it will call > initReplicaRecovery operation on dn1, however, since the dn1 machine is > currently down state at this time, it will take a very long time to timeout, > the default number of retries to establish a server connection is 45 times. > related logs: > {code:java} > 2024-03-13 21:59:31,518 INFO ipc.Client > (Client.java:handleConnectionTimeout(904)) > [org.apache.hadoop.hdfs.server.datanode.BlockRecov
[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829358#comment-17829358 ] ASF GitHub Bot commented on HDFS-17426: --- haiyang1987 commented on PR #6628: URL: https://github.com/apache/hadoop/pull/6628#issuecomment-2011103924 Thanks @ZanderXu for your review and merge~ > Remove Invalid FileSystemECReadStats logic in DFSInputStream > > > Key: HDFS-17426 > URL: https://issues.apache.org/jira/browse/HDFS-17426 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When read the replication file, the following logic will be called when > _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in > DFSInputStream.java > {code:java} > if (readStatistics.getBlockType() == BlockType.STRIPED) { > dfsClient.updateFileSystemECReadStats(nread); > } > {code} > This is invalid call, can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17426. - Resolution: Fixed > Remove Invalid FileSystemECReadStats logic in DFSInputStream > > > Key: HDFS-17426 > URL: https://issues.apache.org/jira/browse/HDFS-17426 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When read the replication file, the following logic will be called when > _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in > DFSInputStream.java > {code:java} > if (readStatistics.getBlockType() == BlockType.STRIPED) { > dfsClient.updateFileSystemECReadStats(nread); > } > {code} > This is invalid call, can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829353#comment-17829353 ] ASF GitHub Bot commented on HDFS-17426: --- ZanderXu commented on PR #6628: URL: https://github.com/apache/hadoop/pull/6628#issuecomment-2011080558 Merged. > Remove Invalid FileSystemECReadStats logic in DFSInputStream > > > Key: HDFS-17426 > URL: https://issues.apache.org/jira/browse/HDFS-17426 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When read the replication file, the following logic will be called when > _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in > DFSInputStream.java > {code:java} > if (readStatistics.getBlockType() == BlockType.STRIPED) { > dfsClient.updateFileSystemECReadStats(nread); > } > {code} > This is invalid call, can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17426) Remove Invalid FileSystemECReadStats logic in DFSInputStream
[ https://issues.apache.org/jira/browse/HDFS-17426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829352#comment-17829352 ] ASF GitHub Bot commented on HDFS-17426: --- ZanderXu merged PR #6628: URL: https://github.com/apache/hadoop/pull/6628 > Remove Invalid FileSystemECReadStats logic in DFSInputStream > > > Key: HDFS-17426 > URL: https://issues.apache.org/jira/browse/HDFS-17426 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > When read the replication file, the following logic will be called when > _readingWithStrategy_ and _actualGetFromOneDataNode_ logic in > DFSInputStream.java > {code:java} > if (readStatistics.getBlockType() == BlockType.STRIPED) { > dfsClient.updateFileSystemECReadStats(nread); > } > {code} > This is invalid call, can remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17103) messy file system cleanup in TestNameEditsConfigs
[ https://issues.apache.org/jira/browse/HDFS-17103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829323#comment-17829323 ] ASF GitHub Bot commented on HDFS-17103: --- teamconfx commented on PR #6071: URL: https://github.com/apache/hadoop/pull/6071#issuecomment-2010748510 Hi @ayushtkn are we able to merge this? > messy file system cleanup in TestNameEditsConfigs > - > > Key: HDFS-17103 > URL: https://issues.apache.org/jira/browse/HDFS-17103 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Labels: pull-request-available > Attachments: reproduce.sh > > > h2. What happened: > Got a {{NullPointerException}} without message when running > {{{}TestNameEditsConfigs{}}}. > h2. Where's the bug: > In line 450 of {{{}TestNameEditsConfigs{}}}, the test attempts to cleanup the > file system: > > {noformat} > ... > fileSys = cluster.getFileSystem(); > ... > } finally { > fileSys.close(); > cluster.shutdown(); > }{noformat} > However, the cleanup would result in a {{NullPointerException}} that covers > up the actual exception if the initialization of {{fileSys}} fails or another > exception is thrown before the line that initializes {{{}fileSys{}}}. > h2. How to reproduce: > (1) Set {{dfs.namenode.maintenance.replication.min}} to {{-1155969698}} > (2) Run test: > {{org.apache.hadoop.hdfs.server.namenode.TestNameEditsConfigs#testNameEditsConfigsFailure}} > h2. Stacktrace: > {noformat} > java.lang.NullPointerException, > at > org.apache.hadoop.hdfs.server.namenode.TestNameEditsConfigs.testNameEditsConfigsFailure(TestNameEditsConfigs.java:450),{noformat} > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17109) Null Pointer Exception when running TestBlockManager
[ https://issues.apache.org/jira/browse/HDFS-17109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829321#comment-17829321 ] ASF GitHub Bot commented on HDFS-17109: --- teamconfx commented on PR #6046: URL: https://github.com/apache/hadoop/pull/6046#issuecomment-2010741429 Hi @goiri is there anything else I can do to make this PR merged? > Null Pointer Exception when running TestBlockManager > > > Key: HDFS-17109 > URL: https://issues.apache.org/jira/browse/HDFS-17109 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Labels: pull-request-available > Attachments: reproduce.sh > > > h2. What happened > After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, > running test > {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} > results in a {{{}NullPointerException{}}}. > h2. Where's the bug > In the class {{{}BlockPlacementPolicyDefault{}}}: > {noformat} > for (StorageType s : storageTypes) { > StorageTypeStats storageTypeStats = storageStats.get(s); > numNodes += storageTypeStats.getNodesInService(); > numXceiver += storageTypeStats.getNodesInServiceXceiverCount(); > }{noformat} > However, the class does not check if the storageTypeStats is null, causing > the NPE. > h2. How to reproduce > # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}} > # Run > {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} > and the following exception should be observed: > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170) > at > org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat} > > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17099) Null Pointer Exception when stop namesystem in HDFS
[ https://issues.apache.org/jira/browse/HDFS-17099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829319#comment-17829319 ] ASF GitHub Bot commented on HDFS-17099: --- teamconfx commented on PR #6034: URL: https://github.com/apache/hadoop/pull/6034#issuecomment-2010737838 Hi @ayushtkn @Hexiaoqiao, are we able to merge this PR if it looks good to you? ;) > Null Pointer Exception when stop namesystem in HDFS > --- > > Key: HDFS-17099 > URL: https://issues.apache.org/jira/browse/HDFS-17099 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Assignee: ConfX >Priority: Critical > Labels: pull-request-available > Attachments: reproduce.sh > > > h2. What happend: > Got NullPointerException when stop namesystem in HDFS. > h2. Buggy code: > > {code:java} > void stopActiveServices() { > ... > if (dir != null && getFSImage() != null) { > if (getFSImage().editLog != null) { // <--- Check whether editLog is > null > getFSImage().editLog.close(); > } > // Update the fsimage with the last txid that we wrote > // so that the tailer starts from the right spot. > getFSImage().updateLastAppliedTxIdFromWritten(); // <--- BUG: Even if > editLog is null, this line will still be executed and cause nullpointer > exception > } > ... > } public void updateLastAppliedTxIdFromWritten() { > this.lastAppliedTxId = editLog.getLastWrittenTxId(); // < This will > cause nullpointer exception if editLog is null > } {code} > h2. StackTrace: > > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSImage.updateLastAppliedTxIdFromWritten(FSImage.java:1553) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.stopActiveServices(FSNamesystem.java:1463) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.close(FSNamesystem.java:1815) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:1017) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:248) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:194) > at > org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.(SecondaryNameNode.java:181) > {code} > h2. How to reproduce: > (1) Set {{dfs.namenode.top.windows.minutes}} to {{{}37914516,32,0{}}}; or set > {{dfs.namenode.top.window.num.buckets}} to {{{}244111242{}}}. > (2) Run test: > {{org.apache.hadoop.hdfs.server.namenode.TestNameNodeHttpServerXFrame#testSecondaryNameNodeXFrame}} > h2. What's more: > I'm still investigating how the parameter > {{dfs.namenode.top.windows.minutes}} triggered the buggy code. > > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17098) DatanodeManager does not handle null storage type properly
[ https://issues.apache.org/jira/browse/HDFS-17098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829316#comment-17829316 ] ASF GitHub Bot commented on HDFS-17098: --- teamconfx commented on code in PR #6035: URL: https://github.com/apache/hadoop/pull/6035#discussion_r1532946326 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -666,7 +666,15 @@ private Consumer> createSecondaryNodeSorter() { Consumer> secondarySort = null; if (readConsiderStorageType) { Comparator comp = - Comparator.comparing(DatanodeInfoWithStorage::getStorageType); + Comparator.comparing(DatanodeInfoWithStorage::getStorageType, (s1, s2) -> { + if (s1 == null) { Review Comment: @Hexiaoqiao we got this when we set the following configuration "dfs.heartbeat.interval=1753310367" and "dfs.namenode.read.considerStorageType=true". Under this config, the test "org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs" would trigger the case. > DatanodeManager does not handle null storage type properly > -- > > Key: HDFS-17098 > URL: https://issues.apache.org/jira/browse/HDFS-17098 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Labels: pull-request-available > Attachments: reproduce.sh > > > h2. What happened: > Got a {{NullPointerException}} without message when sorting datanodes in > {{{}NetworkTopology{}}}. > h2. Where's the bug: > In line 654 of {{{}DatanodeManager{}}}, the manager creates a second sorter > using the standard {{Comparator}} class: > {noformat} > Comparator comp = > Comparator.comparing(DatanodeInfoWithStorage::getStorageType); > secondarySort = list -> Collections.sort(list, comp);{noformat} > This comparator is then used in {{NetworkTopology}} as a secondary sort to > break ties: > {noformat} > if (secondarySort != null) { > // a secondary sort breaks the tie between nodes. > secondarySort.accept(nodesList); > }{noformat} > However, if the storage type is {{{}null{}}}, a {{NullPointerException}} > would be thrown since the default {{Comparator.comparing}} cannot handle > comparison between null values. > h2. How to reproduce: > (1) Set {{dfs.heartbeat.interval}} to {{{}1753310367{}}}, and > {{dfs.namenode.read.considerStorageType}} to {{true}} > (2) Run test: > {{org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock#testAviodStaleAndSlowDatanodes}} > h2. Stacktrace: > {noformat} > java.lang.NullPointerException > at > java.base/java.util.Comparator.lambda$comparing$77a9974f$1(Comparator.java:469) > at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.base/java.util.TimSort.sort(TimSort.java:220) > at java.base/java.util.Arrays.sort(Arrays.java:1515) > at java.base/java.util.ArrayList.sort(ArrayList.java:1750) > at java.base/java.util.Collections.sort(Collections.java:179) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.lambda$createSecondaryNodeSorter$0(DatanodeManager.java:654) > at > org.apache.hadoop.net.NetworkTopology.sortByDistance(NetworkTopology.java:983) > at > org.apache.hadoop.net.NetworkTopology.sortByDistanceUsingNetworkLocation(NetworkTopology.java:946) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlock(DatanodeManager.java:637) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:554) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock.testAviodStaleAndSlowDatanodes(TestSortLocatedBlock.java:144){noformat} > For an easy reproduction, run the reproduce.sh in the attachment. We are > happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17098) DatanodeManager does not handle null storage type properly
[ https://issues.apache.org/jira/browse/HDFS-17098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829314#comment-17829314 ] ASF GitHub Bot commented on HDFS-17098: --- teamconfx commented on code in PR #6035: URL: https://github.com/apache/hadoop/pull/6035#discussion_r1532946326 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -666,7 +666,15 @@ private Consumer> createSecondaryNodeSorter() { Consumer> secondarySort = null; if (readConsiderStorageType) { Comparator comp = - Comparator.comparing(DatanodeInfoWithStorage::getStorageType); + Comparator.comparing(DatanodeInfoWithStorage::getStorageType, (s1, s2) -> { + if (s1 == null) { Review Comment: @Hexiaoqiao we got this when we set the following configuration "dfs.heartbeat.interval=1753310367" and "dfs.namenode.read.considerStorageType="true". Under this config, the test "org.apache.hadoop.hdfs.server.namenode.TestNNStorageRetentionManager#testNoLogs" would trigger the case. > DatanodeManager does not handle null storage type properly > -- > > Key: HDFS-17098 > URL: https://issues.apache.org/jira/browse/HDFS-17098 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Labels: pull-request-available > Attachments: reproduce.sh > > > h2. What happened: > Got a {{NullPointerException}} without message when sorting datanodes in > {{{}NetworkTopology{}}}. > h2. Where's the bug: > In line 654 of {{{}DatanodeManager{}}}, the manager creates a second sorter > using the standard {{Comparator}} class: > {noformat} > Comparator comp = > Comparator.comparing(DatanodeInfoWithStorage::getStorageType); > secondarySort = list -> Collections.sort(list, comp);{noformat} > This comparator is then used in {{NetworkTopology}} as a secondary sort to > break ties: > {noformat} > if (secondarySort != null) { > // a secondary sort breaks the tie between nodes. > secondarySort.accept(nodesList); > }{noformat} > However, if the storage type is {{{}null{}}}, a {{NullPointerException}} > would be thrown since the default {{Comparator.comparing}} cannot handle > comparison between null values. > h2. How to reproduce: > (1) Set {{dfs.heartbeat.interval}} to {{{}1753310367{}}}, and > {{dfs.namenode.read.considerStorageType}} to {{true}} > (2) Run test: > {{org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock#testAviodStaleAndSlowDatanodes}} > h2. Stacktrace: > {noformat} > java.lang.NullPointerException > at > java.base/java.util.Comparator.lambda$comparing$77a9974f$1(Comparator.java:469) > at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.base/java.util.TimSort.sort(TimSort.java:220) > at java.base/java.util.Arrays.sort(Arrays.java:1515) > at java.base/java.util.ArrayList.sort(ArrayList.java:1750) > at java.base/java.util.Collections.sort(Collections.java:179) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.lambda$createSecondaryNodeSorter$0(DatanodeManager.java:654) > at > org.apache.hadoop.net.NetworkTopology.sortByDistance(NetworkTopology.java:983) > at > org.apache.hadoop.net.NetworkTopology.sortByDistanceUsingNetworkLocation(NetworkTopology.java:946) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlock(DatanodeManager.java:637) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.sortLocatedBlocks(DatanodeManager.java:554) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestSortLocatedBlock.testAviodStaleAndSlowDatanodes(TestSortLocatedBlock.java:144){noformat} > For an easy reproduction, run the reproduce.sh in the attachment. We are > happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17829249#comment-17829249 ] ASF GitHub Bot commented on HDFS-17433: --- hadoop-yetus commented on PR #6644: URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2010083682 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 31s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 7s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 15s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 11s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 1m 8s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 39s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 15s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 57s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 10s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 13s | | the patch passed | | +1 :green_heart: | compile | 1m 7s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 1m 7s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 58s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 13s | | the patch passed | | +1 :green_heart: | javadoc | 0m 53s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 44s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 227m 51s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 366m 24s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6644 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 9e01e979dce9 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 2b7a5f58664d3c4467817b5f8b150e66ab71a6ba | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/testReport/ | | Max. process+thread count | 4141 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetu
[jira] [Commented] (HDFS-17423) [FGL] BlockManagerSafeMode supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828953#comment-17828953 ] ASF GitHub Bot commented on HDFS-17423: --- hadoop-yetus commented on PR #6645: URL: https://github.com/apache/hadoop/pull/6645#issuecomment-2009645872 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 30s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ HDFS-17384 Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 56s | | HDFS-17384 passed | | +1 :green_heart: | compile | 1m 21s | | HDFS-17384 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 14s | | HDFS-17384 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 12s | | HDFS-17384 passed | | +1 :green_heart: | mvnsite | 1m 24s | | HDFS-17384 passed | | +1 :green_heart: | javadoc | 1m 7s | | HDFS-17384 passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 41s | | HDFS-17384 passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 14s | | HDFS-17384 passed | | +1 :green_heart: | shadedclient | 35m 26s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 9s | | the patch passed | | +1 :green_heart: | compile | 1m 13s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 13s | | the patch passed | | +1 :green_heart: | compile | 1m 8s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 1m 8s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 59s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 12s | | the patch passed | | +1 :green_heart: | javadoc | 0m 54s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 35m 10s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 230m 46s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6645/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 370m 34s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockManager | | | hadoop.hdfs.server.datanode.TestLargeBlockReport | | | hadoop.hdfs.server.blockmanagement.TestBlockManagerSafeMode | | | hadoop.hdfs.server.blockmanagement.TestReplicationPolicy | | | hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand | | | hadoop.hdfs.protocol.TestBlockListAsLongs | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6645/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6645 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e66939c4dace 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | HDFS-17384 / 274f1b4e0a04ffc64e02ad3438869e8ebe761026 | | Default Java | Private Bui
[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qinyuren updated HDFS-17434: Description: In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine network card bandwidth is 2Mb/s. !image-2024-03-20-19-10-13-016.png|width=662,height=135! !image-2024-03-20-19-55-18-378.png! By adding log printing, it turns out that the Selector.select function has significant overhead. !image-2024-03-20-19-22-29-829.png|width=474,height=262! !image-2024-03-20-19-24-02-233.png|width=445,height=181! I would like to know if this falls within the normal range or how we can improve it. was: In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine network card bandwidth is 10Gb/s. !image-2024-03-20-19-10-13-016.png|width=662,height=135! By adding log printing, it turns out that the Selector.select function has significant overhead. !image-2024-03-20-19-22-29-829.png|width=474,height=262! !image-2024-03-20-19-24-02-233.png|width=445,height=181! I would like to know if this falls within the normal range or how we can improve it. > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Test >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, > image-2024-03-20-19-55-18-378.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 2Mb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > !image-2024-03-20-19-55-18-378.png! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qinyuren updated HDFS-17434: Issue Type: Wish (was: Task) > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Wish >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 10Gb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828925#comment-17828925 ] qinyuren commented on HDFS-17434: - [~hexiaoqiao] [~tasanuma] [~zanderxu] Please take a look. > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Task >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 10Gb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17129) mis-order of ibr and fbr on datanode
[ https://issues.apache.org/jira/browse/HDFS-17129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated HDFS-17129: Attachment: image-2024-03-20-18-07-42-155.png > mis-order of ibr and fbr on datanode > - > > Key: HDFS-17129 > URL: https://issues.apache.org/jira/browse/HDFS-17129 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0, 3.3.9, 3.3.6 > Environment: hdfs3.4.0 >Reporter: liuguanghua >Assignee: liuguanghua >Priority: Blocker > Labels: pull-request-available > Attachments: image-2024-03-20-18-07-42-155.png > > > HDFS-16016 , provide new thread to handler IBR. That is a greate improvement. > But it maybe casue the mis-order of ibr and fbr -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qinyuren updated HDFS-17434: Attachment: image-2024-03-20-19-55-18-378.png > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Test >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png, > image-2024-03-20-19-55-18-378.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 10Gb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qinyuren updated HDFS-17434: Issue Type: Test (was: Wish) > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Test >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 10Gb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828907#comment-17828907 ] ASF GitHub Bot commented on HDFS-15413: --- haiyang1987 commented on code in PR #5829: URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531806620 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java: ## @@ -233,41 +235,62 @@ private ByteBufferStrategy[] getReadStrategies(StripingChunk chunk) { private int readToBuffer(BlockReader blockReader, DatanodeInfo currentNode, ByteBufferStrategy strategy, - ExtendedBlock currentBlock) throws IOException { + LocatedBlock currentBlock, int chunkIndex) throws IOException { final int targetLength = strategy.getTargetLength(); -int length = 0; -try { - while (length < targetLength) { -int ret = strategy.readFromBlock(blockReader); -if (ret < 0) { - throw new IOException("Unexpected EOS from the reader"); +int curAttempts = 0; +while (curAttempts < readDNMaxAttempts) { + curAttempts++; + int length = 0; + try { +while (length < targetLength) { + int ret = strategy.readFromBlock(blockReader); + if (ret < 0) { +throw new IOException("Unexpected EOS from the reader"); + } + length += ret; +} +return length; + } catch (ChecksumException ce) { +DFSClient.LOG.warn("Found Checksum error for " ++ currentBlock + " from " + currentNode ++ " at " + ce.getPos()); +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +// we want to remember which block replicas we have tried +corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), currentNode); +throw ce; + } catch (IOException e) { +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +if (curAttempts < readDNMaxAttempts) { + if (readerInfos[chunkIndex].reader != null) { +readerInfos[chunkIndex].reader.close(); + } + if (dfsStripedInputStream.createBlockReader(currentBlock, + alignedStripe.getOffsetInBlock(), targetBlocks, Review Comment: Hi @Neilxzn @Hexiaoqiao @ayushtkn @zhangshuyan0 @ZanderXu what dou you think? Please also help to look into this issue when you have free time , thanks~ > DFSStripedInputStream throws exception when datanodes close idle connections > > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client >Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 1 > - dfs.datanode.socket.write.timeout = 1 >Reporter: Andrey Elenskiy >Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { > int nread = istream.read(buf, 0, bufLen); > if (nread < 0) { > throw new Exception("nread is less than zero"); > } > readTotal += nread; > bufOffset += nread; > bytesRemaining -= nread; > } > count++; > if (count == countBeforeSleep) { > System.out.println("sleeping for " + sleepDuration + "
[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828905#comment-17828905 ] ASF GitHub Bot commented on HDFS-15413: --- haiyang1987 commented on code in PR #5829: URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531799282 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java: ## @@ -233,41 +235,62 @@ private ByteBufferStrategy[] getReadStrategies(StripingChunk chunk) { private int readToBuffer(BlockReader blockReader, DatanodeInfo currentNode, ByteBufferStrategy strategy, - ExtendedBlock currentBlock) throws IOException { + LocatedBlock currentBlock, int chunkIndex) throws IOException { final int targetLength = strategy.getTargetLength(); -int length = 0; -try { - while (length < targetLength) { -int ret = strategy.readFromBlock(blockReader); -if (ret < 0) { - throw new IOException("Unexpected EOS from the reader"); +int curAttempts = 0; +while (curAttempts < readDNMaxAttempts) { + curAttempts++; + int length = 0; + try { +while (length < targetLength) { + int ret = strategy.readFromBlock(blockReader); + if (ret < 0) { +throw new IOException("Unexpected EOS from the reader"); + } + length += ret; +} +return length; + } catch (ChecksumException ce) { +DFSClient.LOG.warn("Found Checksum error for " ++ currentBlock + " from " + currentNode ++ " at " + ce.getPos()); +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +// we want to remember which block replicas we have tried +corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), currentNode); +throw ce; + } catch (IOException e) { +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +if (curAttempts < readDNMaxAttempts) { + if (readerInfos[chunkIndex].reader != null) { +readerInfos[chunkIndex].reader.close(); + } + if (dfsStripedInputStream.createBlockReader(currentBlock, + alignedStripe.getOffsetInBlock(), targetBlocks, Review Comment: ``` if (dfsStripedInputStream.createBlockReader(currentBlock, offsetInBlock, targetBlocks, ``` > DFSStripedInputStream throws exception when datanodes close idle connections > > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client >Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 1 > - dfs.datanode.socket.write.timeout = 1 >Reporter: Andrey Elenskiy >Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { > int nread = istream.read(buf, 0, bufLen); > if (nread < 0) { > throw new Exception("nread is less than zero"); > } > readTotal += nread; > bufOffset += nread; > bytesRemaining -= nread; > } > count++; > if (count == countBeforeSleep) { > System.out.println("sleeping for " + sleepDuration + " > milliseconds"); > T
[jira] [Updated] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
[ https://issues.apache.org/jira/browse/HDFS-17434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qinyuren updated HDFS-17434: Description: In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine network card bandwidth is 10Gb/s. !image-2024-03-20-19-10-13-016.png|width=662,height=135! By adding log printing, it turns out that the Selector.select function has significant overhead. !image-2024-03-20-19-22-29-829.png|width=474,height=262! !image-2024-03-20-19-24-02-233.png|width=445,height=181! I would like to know if this falls within the normal range or how we can improve it. was: In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges from 5ms to 10ms, exceeding the usual disk reading overhead. !image-2024-03-20-19-10-13-016.png|width=662,height=135! By adding log printing, it turns out that the Selector.select function has significant overhead. !image-2024-03-20-19-22-29-829.png|width=474,height=262! !image-2024-03-20-19-24-02-233.png|width=445,height=181! I would like to know if this falls within the normal range or how we can improve it. > Selector.select in SocketIOWithTimeout.java has significant overhead > > > Key: HDFS-17434 > URL: https://issues.apache.org/jira/browse/HDFS-17434 > Project: Hadoop HDFS > Issue Type: Task >Reporter: qinyuren >Priority: Major > Attachments: image-2024-03-20-19-10-13-016.png, > image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png > > > In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges > from 5ms to 10ms, exceeding the usual disk reading overhead. Our machine > network card bandwidth is 10Gb/s. > !image-2024-03-20-19-10-13-016.png|width=662,height=135! > By adding log printing, it turns out that the Selector.select function has > significant overhead. > !image-2024-03-20-19-22-29-829.png|width=474,height=262! > !image-2024-03-20-19-24-02-233.png|width=445,height=181! > I would like to know if this falls within the normal range or how we can > improve it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16016) BPServiceActor add a new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828914#comment-17828914 ] Xiping Zhang commented on HDFS-16016: - HDFS-16016 is a good improvement, and in our production environment, we have some large DN with 24 disks and total blocks reaching more than 10 million. With the development of hardware, it is possible that the DN will become larger, and if the FBR and IBR are coupled together, the impact on the service is great. HDFS-16016 can solve exactly this scaling problem of DN. For this issue HDFS-17129, I have a solution, which is to redefine the semantics of the FBR. Instead of requiring DN to align all blocks with Namenode by 100% in FBR this time, we only need to compare all blocks before the last block of FBR, although the FBR missed some blocks from the incremental report.I've drawn a diagram for ease of understanding: * step1:It is the NN processing blockreport process before HDFS-16016 is upgraded * step2:NN handles the blockreport process before upgrading HDFS-16016, but there will be problems HDFS-17129 * step3:We operate only on the blocks before the last zero bound point of the FBR * step4:Blocks not manipulated by the previous FBR are processed by the next FBR, unless the DN does not add any new blocks between FBRS !image-2024-03-20-18-31-23-937.png! [~liuguanghua] [~hexiaoqiao] [~tasanuma] hello, Do you have any good suggestions for me to understand FBR now and make this plan? Using lock restrictions here would be like going back to square one. If we use this solution, we only need to remove the remaining to_remove block, and we only need to remove one piece of code. > BPServiceActor add a new thread to handle IBR > - > > Key: HDFS-16016 > URL: https://issues.apache.org/jira/browse/HDFS-16016 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.3.6 > > Attachments: image-2023-11-03-18-11-54-502.png, > image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, > image-2024-03-20-18-31-23-937.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. > We can handle IBR independently to improve the performance of heartbeat and > FBR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17430) RecoveringBlock will skip no live replicas when get block recovery command.
[ https://issues.apache.org/jira/browse/HDFS-17430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828912#comment-17828912 ] ASF GitHub Bot commented on HDFS-17430: --- Hexiaoqiao commented on code in PR #6635: URL: https://github.com/apache/hadoop/pull/6635#discussion_r1531824148 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -1755,12 +1755,24 @@ private BlockRecoveryCommand getBlockRecoveryCommand(String blockPoolId, LOG.info("Skipped stale nodes for recovery : " + (storages.length - recoveryLocations.size())); } -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); } else { -// If too many replicas are stale, then choose all replicas to +// If too many replicas are stale, then choose live replicas to // participate in block recovery. -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); +recoveryLocations.clear(); +storageIdx.clear(); +for (int i = 0; i < storages.length; ++i) { + if (storages[i].getDatanodeDescriptor().isAlive()) { Review Comment: What about add this condition to L1736~L1740 together? ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java: ## @@ -1755,12 +1755,24 @@ private BlockRecoveryCommand getBlockRecoveryCommand(String blockPoolId, LOG.info("Skipped stale nodes for recovery : " + (storages.length - recoveryLocations.size())); } -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); } else { -// If too many replicas are stale, then choose all replicas to +// If too many replicas are stale, then choose live replicas to // participate in block recovery. -recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(storages); +recoveryLocations.clear(); +storageIdx.clear(); +for (int i = 0; i < storages.length; ++i) { + if (storages[i].getDatanodeDescriptor().isAlive()) { +recoveryLocations.add(storages[i]); +storageIdx.add(i); + } +} +assert recoveryLocations.size() > 0 : "recoveryLocations size should be > 0"; Review Comment: Is this assert necessary here, or `recoveryLocations` size could be 0 if all DataNodes are not alive? > RecoveringBlock will skip no live replicas when get block recovery command. > --- > > Key: HDFS-17430 > URL: https://issues.apache.org/jira/browse/HDFS-17430 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > Labels: pull-request-available > > RecoveringBlock maybe skip no live replicas when get block recovery command. > *Issue:* > Currently the following scenarios may lead to failure in the execution of > BlockRecoveryWorker by the datanode, resulting file being not to be closed > for a long time. > *t1.* The block_xxx_xxx has two replicas[dn1,dn2]; the dn1 machine shut down > and will be dead status, the dn2 is live status. > *t2.* Occurs block recovery. > related logs: > {code:java} > 2024-03-13 21:58:00.651 WARN hdfs.StateChange DIR* > NameSystem.internalReleaseLease: File /xxx/file has not been closed. Lease > recovery is in progress. RecoveryId = 28577373754 for block blk_xxx_xxx > {code} > *t3.* The dn2 is chosen for block recovery. > dn1 is marked as stale (is dead state) at this time, here the > recoveryLocations size is 1, currently according to the following logic, dn1 > and dn2 will be chosen to participate in block recovery. > DatanodeManager#getBlockRecoveryCommand > {code:java} >// Skip stale nodes during recovery > final List recoveryLocations = > new ArrayList<>(storages.length); > final List storageIdx = new ArrayList<>(storages.length); > for (int i = 0; i < storages.length; ++i) { >if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) { > recoveryLocations.add(storages[i]); > storageIdx.add(i); >} > } > ... > // If we only get 1 replica after eliminating stale nodes, choose all > // replicas for recovery and let the primary data node handle failures. > DatanodeInfo[] recoveryInfos; > if (recoveryLocations.size() > 1) { >if (recoveryLocations.size() != storages.length) { > LOG.info("Skipped stale nodes for recovery : " > + (storages.length - recoveryLocations.size())); >} >recoveryInfos = DatanodeStorageInfo.toDatanodeInfos(recoveryLocations); > } else { >
[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828906#comment-17828906 ] ASF GitHub Bot commented on HDFS-15413: --- haiyang1987 commented on code in PR #5829: URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531803838 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java: ## @@ -233,41 +235,62 @@ private ByteBufferStrategy[] getReadStrategies(StripingChunk chunk) { private int readToBuffer(BlockReader blockReader, DatanodeInfo currentNode, ByteBufferStrategy strategy, - ExtendedBlock currentBlock) throws IOException { + LocatedBlock currentBlock, int chunkIndex) throws IOException { final int targetLength = strategy.getTargetLength(); -int length = 0; -try { - while (length < targetLength) { -int ret = strategy.readFromBlock(blockReader); -if (ret < 0) { - throw new IOException("Unexpected EOS from the reader"); +int curAttempts = 0; +while (curAttempts < readDNMaxAttempts) { + curAttempts++; + int length = 0; + try { +while (length < targetLength) { + int ret = strategy.readFromBlock(blockReader); + if (ret < 0) { +throw new IOException("Unexpected EOS from the reader"); + } + length += ret; +} +return length; + } catch (ChecksumException ce) { +DFSClient.LOG.warn("Found Checksum error for " ++ currentBlock + " from " + currentNode ++ " at " + ce.getPos()); +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +// we want to remember which block replicas we have tried +corruptedBlocks.addCorruptedBlock(currentBlock.getBlock(), currentNode); +throw ce; + } catch (IOException e) { +//Clear buffer to make next decode success +strategy.getReadBuffer().clear(); +if (curAttempts < readDNMaxAttempts) { + if (readerInfos[chunkIndex].reader != null) { +readerInfos[chunkIndex].reader.close(); + } + if (dfsStripedInputStream.createBlockReader(currentBlock, + alignedStripe.getOffsetInBlock(), targetBlocks, Review Comment: If use pread to read data, if the currently set buffer size is a block size, For a block in a dn, the data of multiple cell units may be read, so the size of the ByteBufferStrategy array in the StripingChunk corresponding to the AlignedStripe is calculated to be multiple (there are multiple List slices in ChunkByteBuffer), https://github.com/apache/hadoop/assets/3760130/40f7a944-ea57-4891-9719-86a1b009244d";> So when processing retry createBlockReader in readToBuffer, we may need to consider the current actual offsetInBlock to avoid reading duplicate data from datanode. > DFSStripedInputStream throws exception when datanodes close idle connections > > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client >Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 1 > - dfs.datanode.socket.write.timeout = 1 >Reporter: Andrey Elenskiy >Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { >
[jira] [Created] (HDFS-17434) Selector.select in SocketIOWithTimeout.java has significant overhead
qinyuren created HDFS-17434: --- Summary: Selector.select in SocketIOWithTimeout.java has significant overhead Key: HDFS-17434 URL: https://issues.apache.org/jira/browse/HDFS-17434 Project: Hadoop HDFS Issue Type: Task Reporter: qinyuren Attachments: image-2024-03-20-19-10-13-016.png, image-2024-03-20-19-22-29-829.png, image-2024-03-20-19-24-02-233.png In our cluster, the SendDataPacketBlockedOnNetworkNanosAvgTime metric ranges from 5ms to 10ms, exceeding the usual disk reading overhead. !image-2024-03-20-19-10-13-016.png|width=662,height=135! By adding log printing, it turns out that the Selector.select function has significant overhead. !image-2024-03-20-19-22-29-829.png|width=474,height=262! !image-2024-03-20-19-24-02-233.png|width=445,height=181! I would like to know if this falls within the normal range or how we can improve it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828902#comment-17828902 ] ASF GitHub Bot commented on HDFS-15413: --- haiyang1987 commented on code in PR #5829: URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531794650 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java: ## @@ -284,7 +307,8 @@ private Callable readCells(final BlockReader reader, int ret = 0; for (ByteBufferStrategy strategy : strategies) { -int bytesReead = readToBuffer(reader, datanode, strategy, currentBlock); +int bytesReead = readToBuffer(reader, datanode, strategy, currentBlock, +chunkIndex); Review Comment: For `readToBuffer` maybe need to consider the current actual offsetInBlock. `readToBuffer(reader, datanode, strategy, currentBlock, chunkIndex, ret);` > DFSStripedInputStream throws exception when datanodes close idle connections > > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client >Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 1 > - dfs.datanode.socket.write.timeout = 1 >Reporter: Andrey Elenskiy >Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { > int nread = istream.read(buf, 0, bufLen); > if (nread < 0) { > throw new Exception("nread is less than zero"); > } > readTotal += nread; > bufOffset += nread; > bytesRemaining -= nread; > } > count++; > if (count == countBeforeSleep) { > System.out.println("sleeping for " + sleepDuration + " > milliseconds"); > Thread.sleep(sleepDuration); > System.out.println("resuming"); > } > if (count == countBeforeSleep + countAfterSleep) { > System.out.println("done"); > break; > } > } > } catch (Exception e) { > System.out.println("exception on read " + count + " read total " > + readTotal); > throw e; > } > } > } > {code} > The issue appears to be due to the fact that datanodes close the connection > of EC client if it doesn't fetch next packet for longer than > dfs.client.socket-timeout. The EC client doesn't retry and instead assumes > that those datanodes went away resulting in "missing blocks" exception. > I was able to consistently reproduce with the following arguments: > {noformat} > bufLen = 100 (just below 1MB which is the size of the stripe) > sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000) > countBeforeSleep = 1 > countAfterSleep = 7 > {noformat} > I've attached the entire log output of running the snippet above against > erasure coded file with RS-3-2-1024k policy. And here are the logs from > datanodes of disconnecting the client: > datanode 1: > {noformat} > 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped > reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver > error processing READ_BLOCK operation src: /10.128.23.40:53748 dst: > /10.128.14.46:9866); java.net.SocketTi
[jira] [Updated] (HDFS-16016) BPServiceActor add a new thread to handle IBR
[ https://issues.apache.org/jira/browse/HDFS-16016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiping Zhang updated HDFS-16016: Attachment: image-2024-03-20-18-31-23-937.png > BPServiceActor add a new thread to handle IBR > - > > Key: HDFS-16016 > URL: https://issues.apache.org/jira/browse/HDFS-16016 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: JiangHua Zhu >Assignee: Viraj Jasani >Priority: Minor > Labels: pull-request-available > Fix For: 3.3.6 > > Attachments: image-2023-11-03-18-11-54-502.png, > image-2023-11-06-10-53-13-584.png, image-2023-11-06-10-55-50-939.png, > image-2024-03-20-18-31-23-937.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > Now BPServiceActor#offerService() is doing many things, FBR, IBR, heartbeat. > We can handle IBR independently to improve the performance of heartbeat and > FBR. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15413) DFSStripedInputStream throws exception when datanodes close idle connections
[ https://issues.apache.org/jira/browse/HDFS-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828904#comment-17828904 ] ASF GitHub Bot commented on HDFS-15413: --- haiyang1987 commented on code in PR #5829: URL: https://github.com/apache/hadoop/pull/5829#discussion_r1531798611 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java: ## @@ -233,41 +235,62 @@ private ByteBufferStrategy[] getReadStrategies(StripingChunk chunk) { private int readToBuffer(BlockReader blockReader, DatanodeInfo currentNode, ByteBufferStrategy strategy, - ExtendedBlock currentBlock) throws IOException { + LocatedBlock currentBlock, int chunkIndex) throws IOException { Review Comment: ``` private int readToBuffer(BlockReader blockReader, DatanodeInfo currentNode, ByteBufferStrategy strategy, LocatedBlock currentBlock, int chunkIndex, long offsetInBlock) ``` > DFSStripedInputStream throws exception when datanodes close idle connections > > > Key: HDFS-15413 > URL: https://issues.apache.org/jira/browse/HDFS-15413 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding, hdfs-client >Affects Versions: 3.1.3 > Environment: - Hadoop 3.1.3 > - erasure coding with ISA-L and RS-3-2-1024k scheme > - running in kubernetes > - dfs.client.socket-timeout = 1 > - dfs.datanode.socket.write.timeout = 1 >Reporter: Andrey Elenskiy >Priority: Critical > Labels: pull-request-available > Attachments: out.log > > > We've run into an issue with compactions failing in HBase when erasure coding > is enabled on a table directory. After digging further I was able to narrow > it down to a seek + read logic and able to reproduce the issue with hdfs > client only: > {code:java} > import org.apache.hadoop.conf.Configuration; > import org.apache.hadoop.fs.Path; > import org.apache.hadoop.fs.FileSystem; > import org.apache.hadoop.fs.FSDataInputStream; > public class ReaderRaw { > public static void main(final String[] args) throws Exception { > Path p = new Path(args[0]); > int bufLen = Integer.parseInt(args[1]); > int sleepDuration = Integer.parseInt(args[2]); > int countBeforeSleep = Integer.parseInt(args[3]); > int countAfterSleep = Integer.parseInt(args[4]); > Configuration conf = new Configuration(); > FSDataInputStream istream = FileSystem.get(conf).open(p); > byte[] buf = new byte[bufLen]; > int readTotal = 0; > int count = 0; > try { > while (true) { > istream.seek(readTotal); > int bytesRemaining = bufLen; > int bufOffset = 0; > while (bytesRemaining > 0) { > int nread = istream.read(buf, 0, bufLen); > if (nread < 0) { > throw new Exception("nread is less than zero"); > } > readTotal += nread; > bufOffset += nread; > bytesRemaining -= nread; > } > count++; > if (count == countBeforeSleep) { > System.out.println("sleeping for " + sleepDuration + " > milliseconds"); > Thread.sleep(sleepDuration); > System.out.println("resuming"); > } > if (count == countBeforeSleep + countAfterSleep) { > System.out.println("done"); > break; > } > } > } catch (Exception e) { > System.out.println("exception on read " + count + " read total " > + readTotal); > throw e; > } > } > } > {code} > The issue appears to be due to the fact that datanodes close the connection > of EC client if it doesn't fetch next packet for longer than > dfs.client.socket-timeout. The EC client doesn't retry and instead assumes > that those datanodes went away resulting in "missing blocks" exception. > I was able to consistently reproduce with the following arguments: > {noformat} > bufLen = 100 (just below 1MB which is the size of the stripe) > sleepDuration = (dfs.client.socket-timeout + 1) * 1000 (in our case 11000) > countBeforeSleep = 1 > countAfterSleep = 7 > {noformat} > I've attached the entire log output of running the snippet above against > erasure coded file with RS-3-2-1024k policy. And here are the logs from > datanodes of disconnecting the client: > datanode 1: > {noformat} > 2020-06-15 19:06:20,697 INFO datanode.DataNode: Likely the client has stopped > reading, disconnecting it (datanode-v11-0-hadoop.hadoop:9866:DataXceiver > error processing READ_BLOCK operation src: /10.128.23.40:53748 dst: > /
[jira] [Commented] (HDFS-17433) metrics sumOfActorCommandQueueLength should only record valid commands
[ https://issues.apache.org/jira/browse/HDFS-17433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828676#comment-17828676 ] ASF GitHub Bot commented on HDFS-17433: --- hadoop-yetus commented on PR #6644: URL: https://github.com/apache/hadoop/pull/6644#issuecomment-2009083667 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 12m 41s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 44m 39s | | trunk passed | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 12s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 27s | | trunk passed | | +1 :green_heart: | javadoc | 1m 6s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 45s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 14s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 10s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 10s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 5s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 1m 5s | | the patch passed | | -1 :x: | blanks | 0m 0s | [/blanks-eol.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/artifact/out/blanks-eol.txt) | The patch has 1 line(s) that end in blanks. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply | | +1 :green_heart: | checkstyle | 0m 58s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 17s | | the patch passed | | +1 :green_heart: | javadoc | 0m 52s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 36s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 11s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 54s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 229m 1s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 381m 9s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.44 ServerAPI=1.44 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6644 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux fdc3d112ff7a 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 31038f1ddc0aa1c3f2c7803bcc47b3418d200be7 | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6644/1/testReport/ | | Max. process+thread count | 4051 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop
[jira] [Updated] (HDFS-17423) [FGL] BlockManagerSafeMode supports fine-grained lock
[ https://issues.apache.org/jira/browse/HDFS-17423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17423: -- Labels: pull-request-available (was: ) > [FGL] BlockManagerSafeMode supports fine-grained lock > - > > Key: HDFS-17423 > URL: https://issues.apache.org/jira/browse/HDFS-17423 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > [FGL] BlockManagerSafeMode supports fine-grained lock -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org