[jira] [Resolved] (HDFS-17503) Unreleased volume references because of OOM
[ https://issues.apache.org/jira/browse/HDFS-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZanderXu resolved HDFS-17503. - Fix Version/s: 3.5.0 Hadoop Flags: Reviewed Resolution: Fixed > Unreleased volume references because of OOM > --- > > Key: HDFS-17503 > URL: https://issues.apache.org/jira/browse/HDFS-17503 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Zilong Zhu >Assignee: Zilong Zhu >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > When BlockSender throws an error because of OOM,the volume reference obtained > by the thread is not released,which causes the thread trying to remove the > volume to wait and fall into an infinite loop. > I found HDFS-15963 catched exception and release volume reference. But it did > not handle the case of throwing errors. I think "catch (Throwable t)" should > be used instead of "catch (IOException ioe)". -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/1580/ [May 7, 2024, 5:29:32 AM] (Sammi Chen) Revert "HADOOP-18851: Performance improvement for DelegationTokenSecretManager. (#6001). Contributed by Vikas Kumar." - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: trunk+JDK11 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java11-linux-x86_64/673/ [May 7, 2024, 5:29:32 AM] (Sammi Chen) Revert "HADOOP-18851: Performance improvement for DelegationTokenSecretManager. (#6001). Contributed by Vikas Kumar." - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17517) [FGL] Abstract lock mode to cover all RPCs
ZanderXu created HDFS-17517: --- Summary: [FGL] Abstract lock mode to cover all RPCs Key: HDFS-17517 URL: https://issues.apache.org/jira/browse/HDFS-17517 Project: Hadoop HDFS Issue Type: Sub-task Reporter: ZanderXu Assignee: ZanderXu There are many RPCs in NameNode. Different RPCs have different process logic for the input path, such as: create、mkdir、getFileInfo. Here we should abstract some of the locking modes used by resolvePath to cover all these RPCs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
Apache Hadoop qbt Report: branch-2.10+JDK7 on Linux/x86_64
For more details, see https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/ No changes -1 overall The following subsystems voted -1: asflicense hadolint mvnsite pathlen unit The following subsystems voted -1 but were configured to be filtered/ignored: cc checkstyle javac javadoc pylint shellcheck whitespace The following subsystems are considered long running: (runtime bigger than 1h 0m 0s) unit Specific tests: Failed junit tests : hadoop.fs.TestFileUtil hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints hadoop.hdfs.TestLeaseRecovery2 hadoop.hdfs.server.blockmanagement.TestReplicationPolicyWithUpgradeDomain hadoop.hdfs.server.namenode.snapshot.TestSnapshotDeletion hadoop.hdfs.TestFileLengthOnClusterRestart hadoop.hdfs.TestDFSInotifyEventInputStream hadoop.hdfs.server.namenode.snapshot.TestSnapshotBlocksMap hadoop.hdfs.qjournal.server.TestJournalNodeRespectsBindHostKeys hadoop.fs.viewfs.TestViewFileSystemHdfs hadoop.hdfs.server.federation.router.TestRouterQuota hadoop.hdfs.server.federation.router.TestRouterNamenodeHeartbeat hadoop.hdfs.server.federation.resolver.order.TestLocalResolver hadoop.hdfs.server.federation.resolver.TestMultipleDestinationResolver hadoop.contrib.bkjournal.TestBookKeeperHACheckpoints hadoop.mapreduce.lib.input.TestLineRecordReader hadoop.mapred.TestLineRecordReader hadoop.mapreduce.jobhistory.TestHistoryViewerPrinter hadoop.resourceestimator.service.TestResourceEstimatorService hadoop.resourceestimator.solver.impl.TestLpSolver hadoop.yarn.sls.TestSLSRunner hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestNumaResourceAllocator hadoop.yarn.server.nodemanager.containermanager.linux.resources.TestNumaResourceHandlerImpl hadoop.yarn.server.resourcemanager.TestClientRMService hadoop.yarn.server.resourcemanager.monitor.invariants.TestMetricsInvariantChecker cc: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-compile-cc-root.txt [4.0K] javac: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-compile-javac-root.txt [488K] checkstyle: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-checkstyle-root.txt [14M] hadolint: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-hadolint.txt [4.0K] mvnsite: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-mvnsite-root.txt [572K] pathlen: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/pathlen.txt [12K] pylint: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-pylint.txt [20K] shellcheck: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/diff-patch-shellcheck.txt [72K] whitespace: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/whitespace-eol.txt [12M] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/whitespace-tabs.txt [1.3M] javadoc: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-javadoc-root.txt [36K] unit: https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-common-project_hadoop-common.txt [220K] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt [1.8M] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt [36K] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs_src_contrib_bkjournal.txt [16K] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt [104K] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-tools_hadoop-azure.txt [20K] https://ci-hadoop.apache.org/job/hadoop-qbt-branch-2.10-java7-linux-x86_64/1387/artifact/out/patch-unit-hadoop-tools_hadoop-resourceestimator.txt [16K]
[jira] [Created] (HDFS-17516) Erasure Coding: Some reconstruction blocks and metrics are inaccuracy when decommission DN which contains many EC blocks.
Chenyu Zheng created HDFS-17516: --- Summary: Erasure Coding: Some reconstruction blocks and metrics are inaccuracy when decommission DN which contains many EC blocks. Key: HDFS-17516 URL: https://issues.apache.org/jira/browse/HDFS-17516 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chenyu Zheng Assignee: Chenyu Zheng Attachments: 截屏2024-05-09 下午3.59.22.png, 截屏2024-05-09 下午3.59.44.png When decommission DN which contains many EC blocks, this DN will mark as busy by scheduleReconstruction, then ErasureCodingWork::addTaskToDatanode will not generate any block to ecBlocksToBeReplicated. Although no DNA_TRANSFER BlockCommand will be generated for this block, pendingReconstruction and neededReconstruction are still updated, and blockmanager mistakenly believes that the block is being copied. The periodic increases of Metrics `fs_namesystem_num_timed_out_pending_reconstructions` and `fs_namesystem_under_replicated_blocks` also prove this. In fact, many blocks are not actually copied. These blocks are re-added to neededReconstruction until they time out. !截屏2024-05-09 下午3.59.44.png!!截屏2024-05-09 下午3.59.22.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
[jira] [Created] (HDFS-17515) Erasure Coding: ErasureCodingWork is not effectively limited during a block reconstruction cycle.
Chenyu Zheng created HDFS-17515: --- Summary: Erasure Coding: ErasureCodingWork is not effectively limited during a block reconstruction cycle. Key: HDFS-17515 URL: https://issues.apache.org/jira/browse/HDFS-17515 Project: Hadoop HDFS Issue Type: Improvement Reporter: Chenyu Zheng Assignee: Chenyu Zheng In a block reconstruction cycle, ErasureCodingWork is not effectively limited. I add some debug log, log when ecBlocksToBeReplicated is an integer multiple of 100. {code:java} 2024-05-09 10:46:06,986 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: ecBlocksToBeReplicated for IP:PORT already have 100 blocks 2024-05-09 10:46:06,987 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: ecBlocksToBeReplicated for IP:PORT already have 200 blocks ... 2024-05-09 10:46:06,992 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: ecBlocksToBeReplicated for IP:PORT already have 2000 blocks 2024-05-09 10:46:06,992 DEBUG org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManagerZCY: ecBlocksToBeReplicated for IP:PORT already have 2100 blocks {code} During a block reconstruction cycle, ecBlocksToBeReplicated increases from 0 to 2100, This is much larger than replicationStreamsHardLimit. This brings unfairness and leads to a greater tendency to copy EC blocks. In fact, for non ec block, this is not a problem. pendingReplicationWithoutTargets increase when schedule work. When pendingReplicationWithoutTargets is too big, will not schedule work for this node. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org