[jira] [Resolved] (HDFS-17503) Unreleased volume references because of OOM

2024-05-09 Thread ZanderXu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZanderXu resolved HDFS-17503.
-
Fix Version/s: 3.5.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> Unreleased volume references because of OOM
> ---
>
> Key: HDFS-17503
> URL: https://issues.apache.org/jira/browse/HDFS-17503
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Zilong Zhu
>Assignee: Zilong Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> When BlockSender throws an error because of OOM,the volume reference obtained 
> by the thread is not released,which causes the thread trying to remove the 
> volume to wait and fall into an infinite loop.
> I found HDFS-15963 catched exception and release volume reference. But it did 
> not handle the case of throwing errors. I think "catch (Throwable t)" should 
> be used instead of "catch (IOException ioe)".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64

2024-05-09 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/1580/

[May 7, 2024, 5:29:32 AM] (Sammi Chen) Revert "HADOOP-18851: Performance 
improvement for DelegationTokenSecretManager. (#6001). Contributed by Vikas 
Kumar."

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Apache Hadoop qbt Report: trunk+JDK11 on Linux/x86_64

2024-05-09 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java11-linux-x86_64/673/

[May 7, 2024, 5:29:32 AM] (Sammi Chen) Revert "HADOOP-18851: Performance 
improvement for DelegationTokenSecretManager. (#6001). Contributed by Vikas 
Kumar."

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-17517) [FGL] Abstract lock mode to cover all RPCs

2024-05-09 Thread ZanderXu (Jira)
ZanderXu created HDFS-17517:
---

 Summary: [FGL] Abstract lock mode to cover all RPCs
 Key: HDFS-17517
 URL: https://issues.apache.org/jira/browse/HDFS-17517
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: ZanderXu
Assignee: ZanderXu


There are many RPCs in NameNode. Different RPCs have different process logic 
for the input path, such as: create、mkdir、getFileInfo.

Here we should abstract some of the locking modes used by resolvePath to cover 
all these RPCs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



Apache Hadoop qbt Report: branch-2.10+JDK7 on Linux/x86_64

2024-05-09 Thread Apache Jenkins Server
For more details, see 
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/

No changes




-1 overall


The following subsystems voted -1:
asflicense hadolint mvnsite pathlen unit


The following subsystems voted -1 but
were configured to be filtered/ignored:
cc checkstyle javac javadoc pylint shellcheck whitespace


The following subsystems are considered long running:
(runtime bigger than 1h  0m  0s)
unit


Specific tests:

Failed junit tests :

   hadoop.fs.TestFileUtil 
   hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints 
   hadoop.hdfs.TestLeaseRecovery2 
   
hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain 
   hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion 
   hadoop.hdfs.TestFileLengthOnClusterRestart 
   hadoop.hdfs.TestDFSInotifyEventInputStream 
   hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap 
   hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys 
   hadoop.fs.viewfs.TestViewFileSystemHdfs 
   hadoop.hdfs.server.federation.router.TestRouterQuota 
   hadoop.hdfs.server.federation.router.TestRouterNamenodeHeartbeat 
   hadoop.hdfs.server.federation.resolver.order.TestLocalResolver 
   hadoop.hdfs.server.federation.resolver.TestMultipleDestinationResolver 
   hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints 
   hadoop.mapreduce.lib.input.TestLineRecordReader 
   hadoop.mapred.TestLineRecordReader 
   hadoop.mapreduce.jobhistory.TestHistoryViewerPrinter 
   hadoop.resourceestimator.service.TestResourceEstimatorService 
   hadoop.resourceestimator.solver.impl.TestLpSolver 
   hadoop.yarn.sls.TestSLSRunner 
   
hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestNumaResourceAllocator
 
   
hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestNumaResourceHandlerImpl
 
   hadoop.yarn.server.resourcemanager.TestClientRMService 
   
hadoop.yarn.server.resourcemanager.monitor.invariants.TestMetricsInvariantChecker
 
  

   cc:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-compile-cc-root.txt
  [4.0K]

   javac:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-compile-javac-root.txt
  [488K]

   checkstyle:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-checkstyle-root.txt
  [14M]

   hadolint:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-hadolint.txt
  [4.0K]

   mvnsite:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-mvnsite-root.txt
  [572K]

   pathlen:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/pathlen.txt
  [12K]

   pylint:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-pylint.txt
  [20K]

   shellcheck:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-shellcheck.txt
  [72K]

   whitespace:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/whitespace-eol.txt
  [12M]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/whitespace-tabs.txt
  [1.3M]

   javadoc:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-javadoc-root.txt
  [36K]

   unit:

   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt
  [220K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
  [1.8M]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt
  [36K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs_src_contrib_bkjournal.txt
  [16K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt
  [104K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-tools_hadoop-azure.txt
  [20K]
   
https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-tools_hadoop-resourceestimator.txt
  [16K]
   

[jira] [Created] (HDFS-17516) Erasure Coding: Some reconstruction blocks and metrics are inaccuracy when decommission DN which contains many EC blocks.

2024-05-09 Thread Chenyu Zheng (Jira)
Chenyu Zheng created HDFS-17516:
---

 Summary: Erasure Coding: Some reconstruction blocks and metrics 
are inaccuracy when decommission DN  which contains many EC blocks.
 Key: HDFS-17516
 URL: https://issues.apache.org/jira/browse/HDFS-17516
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng
 Attachments: 截屏2024-05-09 下午3.59.22.png, 截屏2024-05-09 下午3.59.44.png

When decommission DN  which contains many EC blocks, this DN will mark as busy 
by 
scheduleReconstruction, then ErasureCodingWork::addTaskToDatanode will not 
generate any block to ecBlocksToBeReplicated. 
Although no DNA_TRANSFER BlockCommand will be generated for this block, 
pendingReconstruction and neededReconstruction are still updated, and 
blockmanager mistakenly believes that the block is being copied.
The periodic increases of Metrics 
`fs_namesystem_num_timed_out_pending_reconstructions` and 
`fs_namesystem_under_replicated_blocks` also prove this. In fact, many blocks 
are not actually copied. These blocks are re-added to neededReconstruction 
until they time out.
!截屏2024-05-09 下午3.59.44.png!!截屏2024-05-09 下午3.59.22.png!
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org



[jira] [Created] (HDFS-17515) Erasure Coding: ErasureCodingWork is not effectively limited during a block reconstruction cycle.

2024-05-09 Thread Chenyu Zheng (Jira)
Chenyu Zheng created HDFS-17515:
---

 Summary: Erasure Coding: ErasureCodingWork is not effectively 
limited during a block reconstruction cycle.
 Key: HDFS-17515
 URL: https://issues.apache.org/jira/browse/HDFS-17515
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Chenyu Zheng
Assignee: Chenyu Zheng


In a block reconstruction cycle, ErasureCodingWork is not effectively limited. 
I add some debug log, log when ecBlocksToBeReplicated is an integer multiple of 
100.

 
{code:java}
2024-05-09 10:46:06,986 DEBUG 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: 
ecBlocksToBeReplicated for IP:PORT already have 100 blocks
2024-05-09 10:46:06,987 DEBUG 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: 
ecBlocksToBeReplicated for IP:PORT already have 200 blocks
...
2024-05-09 10:46:06,992 DEBUG 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: 
ecBlocksToBeReplicated for IP:PORT already have 2000 blocks
2024-05-09 10:46:06,992 DEBUG 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: 
ecBlocksToBeReplicated for IP:PORT already have 2100 blocks {code}
 

During a block reconstruction cycle, ecBlocksToBeReplicated increases from 0 to 
2100, This is much larger than replicationStreamsHardLimit. This brings 
unfairness and leads to a greater tendency to copy EC blocks.

In fact, for non ec block, this is not a problem. 
pendingReplicationWithoutTargets increase when schedule work. When 
pendingReplicationWithoutTargets is too big, will not schedule work for this 
node.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org