[jira] [Updated] (HDFS-17228) Add documentation related to BlockManager
[ https://issues.apache.org/jira/browse/HDFS-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-17228: Component/s: documentation > Add documentation related to BlockManager > - > > Key: HDFS-17228 > URL: https://issues.apache.org/jira/browse/HDFS-17228 > Project: Hadoop HDFS > Issue Type: Improvement > Components: block placement, documentation >Affects Versions: 3.3.3, 3.3.6 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Attachments: image-2023-10-17-17-25-27-363.png > > > In the BlockManager file, some important comments are missing. > Happens here: > !image-2023-10-17-17-25-27-363.png! > If it is improved, the robustness of the distributed system can be increased. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17228) Add documentation related to BlockManager
JiangHua Zhu created HDFS-17228: --- Summary: Add documentation related to BlockManager Key: HDFS-17228 URL: https://issues.apache.org/jira/browse/HDFS-17228 Project: Hadoop HDFS Issue Type: Improvement Components: block placement Affects Versions: 3.3.6, 3.3.3 Reporter: JiangHua Zhu Attachments: image-2023-10-17-17-25-27-363.png In the BlockManager file, some important comments are missing. Happens here: !image-2023-10-17-17-25-27-363.png! If it is improved, the robustness of the distributed system can be increased. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17228) Add documentation related to BlockManager
[ https://issues.apache.org/jira/browse/HDFS-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-17228: --- Assignee: JiangHua Zhu > Add documentation related to BlockManager > - > > Key: HDFS-17228 > URL: https://issues.apache.org/jira/browse/HDFS-17228 > Project: Hadoop HDFS > Issue Type: Improvement > Components: block placement >Affects Versions: 3.3.3, 3.3.6 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Attachments: image-2023-10-17-17-25-27-363.png > > > In the BlockManager file, some important comments are missing. > Happens here: > !image-2023-10-17-17-25-27-363.png! > If it is improved, the robustness of the distributed system can be increased. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
[ https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-17012: Attachment: screenshot-1.png > Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT > > > Key: HDFS-17012 > URL: https://issues.apache.org/jira/browse/HDFS-17012 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs >Affects Versions: 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Attachments: screenshot-1.png > > > In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have > been used anywhere, this is a redundant option and we should remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
[ https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-17012: Description: In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have been used anywhere, this is a redundant option and we should remove it. !screenshot-1.png! was:In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have been used anywhere, this is a redundant option and we should remove it. > Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT > > > Key: HDFS-17012 > URL: https://issues.apache.org/jira/browse/HDFS-17012 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs >Affects Versions: 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Attachments: screenshot-1.png > > > In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have > been used anywhere, this is a redundant option and we should remove it. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
JiangHua Zhu created HDFS-17012: --- Summary: Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT Key: HDFS-17012 URL: https://issues.apache.org/jira/browse/HDFS-17012 Project: Hadoop HDFS Issue Type: Improvement Components: datanode, hdfs Affects Versions: 3.3.4 Reporter: JiangHua Zhu In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have been used anywhere, this is a redundant option and we should remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
[ https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-17012: --- Assignee: JiangHua Zhu > Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT > > > Key: HDFS-17012 > URL: https://issues.apache.org/jira/browse/HDFS-17012 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, hdfs >Affects Versions: 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have > been used anywhere, this is a redundant option and we should remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16863) Optimize frequency of regular block reports
[ https://issues.apache.org/jira/browse/HDFS-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655400#comment-17655400 ] JiangHua Zhu commented on HDFS-16863: - [~yuyanlei], if FBR is reduced, will it have any new impact: 1. Some duplicates exist on Datanodes. NameNode should be notified but is not notified in time. 2. Complete the copy data saved by NameNode. > Optimize frequency of regular block reports > --- > > Key: HDFS-16863 > URL: https://issues.apache.org/jira/browse/HDFS-16863 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yanlei Yu >Priority: Major > Attachments: HDFS-16863.patch > > > like HDFS-15162 > Avoid sending block report at regular interval, if there is no failover, > DiskError or any exception encountered in connecting to the Namenode. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface
[ https://issues.apache.org/jira/browse/HDFS-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16807: Affects Version/s: 2.9.2 > Improve legacy ClientProtocol#rename2() interface > - > > Key: HDFS-16807 > URL: https://issues.apache.org/jira/browse/HDFS-16807 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 2.9.2, 3.3.3 >Reporter: JiangHua Zhu >Priority: Major > > In HDFS-2298, rename2() replaced rename(), which is a very meaningful > improvement. It looks like some old customs are still preserved, they are: > 1. When using the shell to execute the mv command, rename() is still used. > ./bin/hdfs dfs -mv [source] [target] > {code:java} > In MoveCommands#Rename: > protected void processPath(PathData src, PathData target) throws > IOException { > .. > if (!target.fs.rename(src.path, target.path)) { > // we have no way to know the actual error... > throw new PathIOException(src.toString()); > } > } > {code} > 2. When NNThroughputBenchmark verifies the rename. > In NNThroughputBenchmark#RenameFileStats: > {code:java} > long executeOp(int daemonId, int inputIdx, String ignore) > throws IOException { > long start = Time.now(); > clientProto.rename(fileNames[daemonId][inputIdx], > destNames[daemonId][inputIdx]); > long end = Time.now(); > return end-start; > } > {code} > I think the interface should be kept uniform since rename() is deprecated. > For NNThroughputBenchmark, it's easy. But it is not easy to improve > MoveCommands, because it involves the transformation of FileSystem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface
[ https://issues.apache.org/jira/browse/HDFS-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620206#comment-17620206 ] JiangHua Zhu commented on HDFS-16807: - Can you guys post some suggestions? [~weichiu] [~aajisaka] [~hexiaoqiao] [~steve_l] [~ayushtkn]. Any suggestion is fine. > Improve legacy ClientProtocol#rename2() interface > - > > Key: HDFS-16807 > URL: https://issues.apache.org/jira/browse/HDFS-16807 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsclient >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Priority: Major > > In HDFS-2298, rename2() replaced rename(), which is a very meaningful > improvement. It looks like some old customs are still preserved, they are: > 1. When using the shell to execute the mv command, rename() is still used. > ./bin/hdfs dfs -mv [source] [target] > {code:java} > In MoveCommands#Rename: > protected void processPath(PathData src, PathData target) throws > IOException { > .. > if (!target.fs.rename(src.path, target.path)) { > // we have no way to know the actual error... > throw new PathIOException(src.toString()); > } > } > {code} > 2. When NNThroughputBenchmark verifies the rename. > In NNThroughputBenchmark#RenameFileStats: > {code:java} > long executeOp(int daemonId, int inputIdx, String ignore) > throws IOException { > long start = Time.now(); > clientProto.rename(fileNames[daemonId][inputIdx], > destNames[daemonId][inputIdx]); > long end = Time.now(); > return end-start; > } > {code} > I think the interface should be kept uniform since rename() is deprecated. > For NNThroughputBenchmark, it's easy. But it is not easy to improve > MoveCommands, because it involves the transformation of FileSystem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface
JiangHua Zhu created HDFS-16807: --- Summary: Improve legacy ClientProtocol#rename2() interface Key: HDFS-16807 URL: https://issues.apache.org/jira/browse/HDFS-16807 Project: Hadoop HDFS Issue Type: Improvement Components: dfsclient Affects Versions: 3.3.3 Reporter: JiangHua Zhu In HDFS-2298, rename2() replaced rename(), which is a very meaningful improvement. It looks like some old customs are still preserved, they are: 1. When using the shell to execute the mv command, rename() is still used. ./bin/hdfs dfs -mv [source] [target] {code:java} In MoveCommands#Rename: protected void processPath(PathData src, PathData target) throws IOException { .. if (!target.fs.rename(src.path, target.path)) { // we have no way to know the actual error... throw new PathIOException(src.toString()); } } {code} 2. When NNThroughputBenchmark verifies the rename. In NNThroughputBenchmark#RenameFileStats: {code:java} long executeOp(int daemonId, int inputIdx, String ignore) throws IOException { long start = Time.now(); clientProto.rename(fileNames[daemonId][inputIdx], destNames[daemonId][inputIdx]); long end = Time.now(); return end-start; } {code} I think the interface should be kept uniform since rename() is deprecated. For NNThroughputBenchmark, it's easy. But it is not easy to improve MoveCommands, because it involves the transformation of FileSystem. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}
[ https://issues.apache.org/jira/browse/HDFS-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619276#comment-17619276 ] JiangHua Zhu commented on HDFS-14750: - Thanks [~xuzq_zander] for the work. I have read your design and have some doubts: 1. Will the penalty time produced during the router run really disappear a lot? 2. Will it affect existing features such as Quota. 3. Overall, RBF still has a lot of room for development, and some compatibility needs to be considered. I have some ideas of my own that I can add to, if I can. 1: Add additional functions to each Router. include: 1.1. Collect your own processing performance indicators, similar to the role of sliding windows. 1.2. Dynamically set the maximum allowable processing upper limit according to the value of the sliding window. 2. Isolate the exception handler and namespace. Through these processes, the current problems can be effectively alleviated, and good compatibility can also be maintained. > RBF: Improved isolation for downstream name nodes. {Dynamic} > > > Key: HDFS-14750 > URL: https://issues.apache.org/jira/browse/HDFS-14750 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: CR Hota >Assignee: CR Hota >Priority: Major > Labels: pull-request-available > Time Spent: 4h > Remaining Estimate: 0h > > This Jira tracks the work around dynamic allocation of resources in routers > for downstream hdfs clusters. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16803) Improve some annotations in hdfs module
[ https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16803: Description: In hdfs module, some annotations are out of date. E.g: {code:java} FSDirRenameOp: /** * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath, * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} */ static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, String src, String dst, BlocksMapUpdateInfo collectedBlocks, boolean logRetryCache,Options.Rename... options) throws IOException { {code} We should try to improve these annotations to make the documentation look more comfortable. was: In FSDirRenameOp, some annotations are out of date. E.g: {code:java} /** * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath, * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} */ static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, String src, String dst, BlocksMapUpdateInfo collectedBlocks, boolean logRetryCache,Options.Rename... options) throws IOException { {code} We should try to improve these annotations to make the documentation look more comfortable. > Improve some annotations in hdfs module > --- > > Key: HDFS-16803 > URL: https://issues.apache.org/jira/browse/HDFS-16803 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation, namenode >Affects Versions: 2.9.2, 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > > In hdfs module, some annotations are out of date. E.g: > {code:java} > FSDirRenameOp: > /** >* @see {@link #unprotectedRenameTo(FSDirectory, String, String, > INodesInPath, >* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} >*/ > static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, > String src, String dst, BlocksMapUpdateInfo collectedBlocks, > boolean logRetryCache,Options.Rename... options) > throws IOException { > {code} > We should try to improve these annotations to make the documentation look > more comfortable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16805) Support set the number of RPC Readers according to different RPC Servers in NameNode
[ https://issues.apache.org/jira/browse/HDFS-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618653#comment-17618653 ] JiangHua Zhu commented on HDFS-16805: - [~haiyang Hu], here is a similar jira: HDFS-16107 > Support set the number of RPC Readers according to different RPC Servers in > NameNode > > > Key: HDFS-16805 > URL: https://issues.apache.org/jira/browse/HDFS-16805 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Currently, multiple rpc servers are started in namenode, such as client rpc > server, service rpc server, lifeline rpc server, and each rpc server used the > same parameter 'ipc.server.read.threadpool.size' setting for the number of > reader threads for this server . > Consider according to different rpc server requirements set the number of > reader threads for this server,for example: > In client RPC server use parameter 'dfs.namenode.reader.count' , > In service RPC server use parameter 'dfs.namenode.service.reader.count' , > In lifeline RPC server use parameter 'dfs.namenode.lifeline.reader.count' -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16803) Improve some annotations in hdfs module
[ https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16803: Summary: Improve some annotations in hdfs module (was: Improve some annotations in FSDirRenameOp) > Improve some annotations in hdfs module > --- > > Key: HDFS-16803 > URL: https://issues.apache.org/jira/browse/HDFS-16803 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation, namenode >Affects Versions: 2.9.2, 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > > In FSDirRenameOp, some annotations are out of date. E.g: > {code:java} > /** >* @see {@link #unprotectedRenameTo(FSDirectory, String, String, > INodesInPath, >* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} >*/ > static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, > String src, String dst, BlocksMapUpdateInfo collectedBlocks, > boolean logRetryCache,Options.Rename... options) > throws IOException { > {code} > We should try to improve these annotations to make the documentation look > more comfortable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16803) Improve some annotations in FSDirRenameOp
JiangHua Zhu created HDFS-16803: --- Summary: Improve some annotations in FSDirRenameOp Key: HDFS-16803 URL: https://issues.apache.org/jira/browse/HDFS-16803 Project: Hadoop HDFS Issue Type: Improvement Components: documentation, namenode Affects Versions: 3.3.4, 2.9.2 Reporter: JiangHua Zhu In FSDirRenameOp, some annotations are out of date. E.g: {code:java} /** * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath, * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} */ static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, String src, String dst, BlocksMapUpdateInfo collectedBlocks, boolean logRetryCache,Options.Rename... options) throws IOException { {code} We should try to improve these annotations to make the documentation look more comfortable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16803) Improve some annotations in FSDirRenameOp
[ https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16803: --- Assignee: JiangHua Zhu > Improve some annotations in FSDirRenameOp > - > > Key: HDFS-16803 > URL: https://issues.apache.org/jira/browse/HDFS-16803 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation, namenode >Affects Versions: 2.9.2, 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > In FSDirRenameOp, some annotations are out of date. E.g: > {code:java} > /** >* @see {@link #unprotectedRenameTo(FSDirectory, String, String, > INodesInPath, >* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)} >*/ > static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc, > String src, String dst, BlocksMapUpdateInfo collectedBlocks, > boolean logRetryCache,Options.Rename... options) > throws IOException { > {code} > We should try to improve these annotations to make the documentation look > more comfortable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16802) Print options when accessing ClientProtocol#rename2()
[ https://issues.apache.org/jira/browse/HDFS-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616873#comment-17616873 ] JiangHua Zhu commented on HDFS-16802: - New log format: {code:java} 2022-10-13 12:11:38,813 [Listener at localhost/58086] DEBUG hdfs.StateChange (FSDirRenameOp.java:renameToInt(256)) - DIR* NameSystem.renameTo: with options - /testNamenodeRetryCache/testRename2/src to /testNamenodeRetryCache /testRename2/target, options=[NONE] {code} > Print options when accessing ClientProtocol#rename2() > - > > Key: HDFS-16802 > URL: https://issues.apache.org/jira/browse/HDFS-16802 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > When accessing ClientProtocol#rename2(), the carried options cannot be seen > in the log. Here is some log information: > {code:java} > 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG hdfs.StateChange > (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with > options - /testNamenodeRetryCache/testRename2/src to > /testNamenodeRetryCache/testRename2/target > {code} > We should improve this, maybe printing options would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16802) Print options when accessing ClientProtocol#rename2()
[ https://issues.apache.org/jira/browse/HDFS-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16802: --- Assignee: JiangHua Zhu > Print options when accessing ClientProtocol#rename2() > - > > Key: HDFS-16802 > URL: https://issues.apache.org/jira/browse/HDFS-16802 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.4 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > When accessing ClientProtocol#rename2(), the carried options cannot be seen > in the log. Here is some log information: > {code:java} > 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG hdfs.StateChange > (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with > options - /testNamenodeRetryCache/testRename2/src to > /testNamenodeRetryCache/testRename2/target > {code} > We should improve this, maybe printing options would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16802) Print options when accessing ClientProtocol#rename2()
JiangHua Zhu created HDFS-16802: --- Summary: Print options when accessing ClientProtocol#rename2() Key: HDFS-16802 URL: https://issues.apache.org/jira/browse/HDFS-16802 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.4 Reporter: JiangHua Zhu When accessing ClientProtocol#rename2(), the carried options cannot be seen in the log. Here is some log information: {code:java} 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG hdfs.StateChange (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with options - /testNamenodeRetryCache/testRename2/src to /testNamenodeRetryCache/testRename2/target {code} We should improve this, maybe printing options would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16733) Improve INode#isRoot()
[ https://issues.apache.org/jira/browse/HDFS-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16733: Description: When constructing an INodeFile or INodeDirectory, it is usually necessary to give a name. For getLocalNameBytes, there are not many restrictions, such as null can be set. But an exception is thrown: {code:java} INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, null, perm, 0L); {code} Some exceptions: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78) {code} Although these situations rarely occur in production environments, we should refine the implementation of isRoot() to avoid this exception. This can enhance system robustness. was: When constructing an INodeFile or INodeDirectory, it is usually necessary to give a name. For getLocalNameBytes, there are not many restrictions, such as null can be set. But an exception is thrown: {code:java} INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, null, perm, 0L); {code} Some exceptions: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78) at org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274) {code} Although these situations rarely occur in production environments, we should refine the implementation of isRoot() to avoid this exception. This can enhance system robustness. > Improve INode#isRoot() > -- > > Key: HDFS-16733 > URL: https://issues.apache.org/jira/browse/HDFS-16733 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > > When constructing an INodeFile or INodeDirectory, it is usually necessary to > give a name. For getLocalNameBytes, there are not many restrictions, such as > null can be set. But an exception is thrown: > {code:java} > INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, > null, perm, 0L); > {code} > Some exceptions: > {code:java} > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78) > {code} > Although these situations rarely occur in production environments, we should > refine the implementation of isRoot() to avoid this exception. This can > enhance system robustness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16733) Improve INode#isRoot()
[ https://issues.apache.org/jira/browse/HDFS-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16733: --- Assignee: JiangHua Zhu > Improve INode#isRoot() > -- > > Key: HDFS-16733 > URL: https://issues.apache.org/jira/browse/HDFS-16733 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > When constructing an INodeFile or INodeDirectory, it is usually necessary to > give a name. For getLocalNameBytes, there are not many restrictions, such as > null can be set. But an exception is thrown: > {code:java} > INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, > null, perm, 0L); > {code} > Some exceptions: > {code:java} > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78) > at > org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274) > {code} > Although these situations rarely occur in production environments, we should > refine the implementation of isRoot() to avoid this exception. This can > enhance system robustness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16733) Improve INode#isRoot()
JiangHua Zhu created HDFS-16733: --- Summary: Improve INode#isRoot() Key: HDFS-16733 URL: https://issues.apache.org/jira/browse/HDFS-16733 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.3.0 Reporter: JiangHua Zhu When constructing an INodeFile or INodeDirectory, it is usually necessary to give a name. For getLocalNameBytes, there are not many restrictions, such as null can be set. But an exception is thrown: {code:java} INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, null, perm, 0L); {code} Some exceptions: {code:java} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78) at org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274) {code} Although these situations rarely occur in production environments, we should refine the implementation of isRoot() to avoid this exception. This can enhance system robustness. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16729) RBF: fix some unreasonably annotated docs
[ https://issues.apache.org/jira/browse/HDFS-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16729: Component/s: documentation > RBF: fix some unreasonably annotated docs > - > > Key: HDFS-16729 > URL: https://issues.apache.org/jira/browse/HDFS-16729 > Project: Hadoop HDFS > Issue Type: Improvement > Components: documentation, rbf >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Attachments: image-2022-08-16-14-19-07-630.png > > > I found some unreasonably annotated documentation here. E.g: > !image-2022-08-16-14-19-07-630.png! > It should be our job to make these annotations cleaner. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16729) RBF: fix some unreasonably annotated docs
[ https://issues.apache.org/jira/browse/HDFS-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16729: --- Assignee: JiangHua Zhu > RBF: fix some unreasonably annotated docs > - > > Key: HDFS-16729 > URL: https://issues.apache.org/jira/browse/HDFS-16729 > Project: Hadoop HDFS > Issue Type: Improvement > Components: rbf >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-08-16-14-19-07-630.png > > > I found some unreasonably annotated documentation here. E.g: > !image-2022-08-16-14-19-07-630.png! > It should be our job to make these annotations cleaner. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16729) RBF: fix some unreasonably annotated docs
JiangHua Zhu created HDFS-16729: --- Summary: RBF: fix some unreasonably annotated docs Key: HDFS-16729 URL: https://issues.apache.org/jira/browse/HDFS-16729 Project: Hadoop HDFS Issue Type: Improvement Components: rbf Affects Versions: 3.3.3 Reporter: JiangHua Zhu Attachments: image-2022-08-16-14-19-07-630.png I found some unreasonably annotated documentation here. E.g: !image-2022-08-16-14-19-07-630.png! It should be our job to make these annotations cleaner. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log
[ https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16700 started by JiangHua Zhu. --- > Record the real client ip carried by the Router in the NameNode log > --- > > Key: HDFS-16700 > URL: https://issues.apache.org/jira/browse/HDFS-16700 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, rbf >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Here are some logs recorded by the NameNode when using RBF: > {code:java} > 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001 > {code} > The ip information here is still the router. If the real client ip is > recorded, it will more clearly express where the request comes from. > E.g: > {code:java} > 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001, client=172.111.65.123:43232 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log
[ https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16700: Affects Version/s: (was: 3.3.0) > Record the real client ip carried by the Router in the NameNode log > --- > > Key: HDFS-16700 > URL: https://issues.apache.org/jira/browse/HDFS-16700 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, rbf >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > Here are some logs recorded by the NameNode when using RBF: > {code:java} > 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001 > {code} > The ip information here is still the router. If the real client ip is > recorded, it will more clearly express where the request comes from. > E.g: > {code:java} > 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001, client=172.111.65.123:43232 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log
[ https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16700: Affects Version/s: 3.3.0 > Record the real client ip carried by the Router in the NameNode log > --- > > Key: HDFS-16700 > URL: https://issues.apache.org/jira/browse/HDFS-16700 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, rbf >Affects Versions: 3.3.0, 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > Here are some logs recorded by the NameNode when using RBF: > {code:java} > 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001 > {code} > The ip information here is still the router. If the real client ip is > recorded, it will more clearly express where the request comes from. > E.g: > {code:java} > 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001, client=172.111.65.123:43232 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log
JiangHua Zhu created HDFS-16700: --- Summary: Record the real client ip carried by the Router in the NameNode log Key: HDFS-16700 URL: https://issues.apache.org/jira/browse/HDFS-16700 Project: Hadoop HDFS Issue Type: Improvement Components: namenode, rbf Affects Versions: 3.3.3 Reporter: JiangHua Zhu Here are some logs recorded by the NameNode when using RBF: {code:java} 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 8020, call Call#127 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.10.100.67:58001 {code} The ip information here is still the router. If the real client ip is recorded, it will more clearly express where the request comes from. E.g: {code:java} 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 8020, call Call#127 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 172.10.100.67:58001, client=172.111.65.123:43232 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log
[ https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16700: --- Assignee: JiangHua Zhu > Record the real client ip carried by the Router in the NameNode log > --- > > Key: HDFS-16700 > URL: https://issues.apache.org/jira/browse/HDFS-16700 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode, rbf >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > Here are some logs recorded by the NameNode when using RBF: > {code:java} > 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001 > {code} > The ip information here is still the router. If the real client ip is > recorded, it will more clearly express where the request comes from. > E.g: > {code:java} > 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port > 8020, call Call#127 Retry#0 > org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from > 172.10.100.67:58001, client=172.111.65.123:43232 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568284#comment-17568284 ] JiangHua Zhu commented on HDFS-16565: - Thanks to [~weichiu] for the suggestion. I think I will use it. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ec >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16650: Fix Version/s: (was: 2.9.2) Affects Version/s: 2.9.2 > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > {code:java} > long now = new Date().getTime(); > {code} > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > {code:java} > long now = Time.now(); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16650: Description: Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); {code:java} long now = new Date().getTime(); {code} This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); was: Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Fix For: 2.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > long now = new Date().getTime(); > {code:java} > long now = new Date().getTime(); > {code} > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16650: Description: Getting timestamps in Centralized cache management is done in the following ways: {code:java} long now = new Date().getTime(); {code} This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: {code:java} long now = Time.now(); {code} was: Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); {code:java} long now = new Date().getTime(); {code} This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Fix For: 2.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > {code:java} > long now = new Date().getTime(); > {code} > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > {code:java} > long now = Time.now(); > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16650: Description: Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); was: Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Fix For: 2.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > long now = new Date().getTime(); > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16650: Priority: Minor (was: Major) > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Fix For: 2.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > long now = new Date().getTime(); > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16650 started by JiangHua Zhu. --- > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Fix For: 2.9.2 > > Time Spent: 10m > Remaining Estimate: 0h > > Getting timestamps in Centralized cache management is done in the following > ways: > long now = new Date().getTime(); > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
JiangHua Zhu created HDFS-16650: --- Summary: Optimize the cost of obtaining timestamps in Centralized cache management Key: HDFS-16650 URL: https://issues.apache.org/jira/browse/HDFS-16650 Project: Hadoop HDFS Issue Type: Improvement Components: caching Reporter: JiangHua Zhu Fix For: 2.9.2 Getting timestamps in Centralized cache management is done in the following ways: long now = new Date().getTime(); This approach doesn't seem to be optimal since we only use it once here. It might be better to use the tool Time to get the timestamp. E.g: long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management
[ https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16650: --- Assignee: JiangHua Zhu > Optimize the cost of obtaining timestamps in Centralized cache management > - > > Key: HDFS-16650 > URL: https://issues.apache.org/jira/browse/HDFS-16650 > Project: Hadoop HDFS > Issue Type: Improvement > Components: caching >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Fix For: 2.9.2 > > > Getting timestamps in Centralized cache management is done in the following > ways: > long now = new Date().getTime(); > This approach doesn't seem to be optimal since we only use it once here. > It might be better to use the tool Time to get the timestamp. E.g: > long now = Time.now(); -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY
[ https://issues.apache.org/jira/browse/HDFS-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16647 started by JiangHua Zhu. --- > Delete unused NameNode#FS_HDFS_IMPL_KEY > --- > > Key: HDFS-16647 > URL: https://issues.apache.org/jira/browse/HDFS-16647 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in > HDFS-15450, and something was removed later in HDFS-15533, but > FS_HDFS_IMPL_KEY was kept. > Here are some discussion details: > https://github.com/apache/hadoop/pull/2229#discussion_r470935801 > It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY
JiangHua Zhu created HDFS-16647: --- Summary: Delete unused NameNode#FS_HDFS_IMPL_KEY Key: HDFS-16647 URL: https://issues.apache.org/jira/browse/HDFS-16647 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.3 Reporter: JiangHua Zhu There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in HDFS-15450, and something was removed later in HDFS-15533, but FS_HDFS_IMPL_KEY was kept. Here are some discussion details: https://github.com/apache/hadoop/pull/2229#discussion_r470935801 It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY
[ https://issues.apache.org/jira/browse/HDFS-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16647: --- Assignee: JiangHua Zhu > Delete unused NameNode#FS_HDFS_IMPL_KEY > --- > > Key: HDFS-16647 > URL: https://issues.apache.org/jira/browse/HDFS-16647 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.3 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in > HDFS-15450, and something was removed later in HDFS-15533, but > FS_HDFS_IMPL_KEY was kept. > Here are some discussion details: > https://github.com/apache/hadoop/pull/2229#discussion_r470935801 > It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15533) Provide DFS API compatible class(ViewDistributedFileSystem), but use ViewFileSystemOverloadScheme inside
[ https://issues.apache.org/jira/browse/HDFS-15533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561289#comment-17561289 ] JiangHua Zhu commented on HDFS-15533: - Nice to talk to you, [~umamaheswararao]. It seems that the redundant NameNode#FS_HDFS_IMPL_KEY should be removed here. If necessary, I will create a new jira to fix it. Hope to continue to communicate with you. > Provide DFS API compatible class(ViewDistributedFileSystem), but use > ViewFileSystemOverloadScheme inside > > > Key: HDFS-15533 > URL: https://issues.apache.org/jira/browse/HDFS-15533 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: dfs, viewfs >Affects Versions: 3.4.0 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Major > Fix For: 3.3.1, 3.4.0 > > > I have been working on a thought from last week is that, we wanted to provide > DFS compatible APIs with mount functionality. So, that existing DFS > applications can work with out class cast issues. > When we tested with other components like Hive and HBase, I noticed some > classcast issues. > {code:java} > HBase example: > java.lang.ClassCastException: > org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme cannot be cast to > org.apache.hadoop.hdfs.DistributedFileSystemjava.lang.ClassCastException: > org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme cannot be cast to > org.apache.hadoop.hdfs.DistributedFileSystem at > org.apache.hadoop.hbase.util.FSUtils.getDFSHedgedReadMetrics(FSUtils.java:1748) > at > org.apache.hadoop.hbase.regionserver.MetricsRegionServerWrapperImpl.(MetricsRegionServerWrapperImpl.java:146) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1594) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1001) > at java.lang.Thread.run(Thread.java:748){code} > {code:java} > Hive: > |io.AcidUtils|: Failed to get files with ID; using regular API: Only > supported for DFS; got class > org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme{code} > SO, here the implementation details are like follows: > We extended DistributedFileSystem and created a class called " > ViewDistributedFileSystem" > This vfs=ViewFirstibutedFileSystem, try to initialize > ViewFileSystemOverloadScheme. If success call will delegate to vfs. If fails > to initialize due to no mount points, or other errors, it will just fallback > to regular DFS init. If users does not configure any mount, system will > behave exactly like today's DFS. If there are mount points, vfs functionality > will come under DFS. > I have a patch and will post it in some time. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556601#comment-17556601 ] JiangHua Zhu commented on HDFS-16637: - Thank you for your trust, [~vjasani]. I will be very careful in the future. > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing
[ https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556183#comment-17556183 ] JiangHua Zhu commented on HDFS-16637: - Thanks to [~vjasani] for finding this question. I think it was due to my carelessness. I'm very sorry. > TestHDFSCLI#testAll consistently failing > > > Key: HDFS-16637 > URL: https://issues.apache.org/jira/browse/HDFS-16637 > Project: Hadoop HDFS > Issue Type: Test >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The failure seems to have been caused by output change introduced by > HDFS-16581. > {code:java} > 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(146)) - Detailed results: > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(147)) - > --2022-06-19 15:41:16,184 [Listener at > localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(156)) - > --- > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(157)) - Test ID: [629] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(158)) - Test Description: > [printTopology: verifying that the topology map is what we expect] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(159)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs > hdfs://localhost:51486 -printTopology] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(167)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(174)) - > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(178)) - Comparator: > [RegexpAcrossOutputComparator] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(180)) - Comparision result: > [fail] > 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(182)) - Expected output: > [^Rack: > \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)] > 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO cli.CLITestHelper > (CLITestHelper.java:displayResults(184)) - Actual output: > [Rack: /rack1 > 127.0.0.1:51487 (localhost) In Service > 127.0.0.1:51491 (localhost) In ServiceRack: /rack2 > 127.0.0.1:51500 (localhost) In Service > 127.0.0.1:51496 (localhost) In Service > 127.0.0.1:51504 (localhost) In ServiceRack: /rack3 > 127.0.0.1:51508 (localhost) In ServiceRack: /rack4 > 127.0.0.1:51512 (localhost) In Service > 127.0.0.1:51516 (localhost) In Service] > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11448) JN log segment syncing should support HA upgrade
[ https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545848#comment-17545848 ] JiangHua Zhu edited comment on HDFS-11448 at 6/10/22 7:15 AM: -- Hi [~hanishakoneru], nice to communicate with you. In JNStorage, getCurrentDir() is not used anywhere. If you don't mind, I'll remove JNStorage#getCurrentDir() which is not used. was (Author: jianghuazhu): Hi [~hanishakoneru], nice to communicate with you. I found the new addition of JNStorage#getCurrentDir() here, and yes, that's good because sd.getCurrentDir() is used in multiple places in the context, but there is no use of it anywhere. If you don't mind, I'll modify this to replace sd.getCurrentDir() with JNStorage#getCurrentDir(). > JN log segment syncing should support HA upgrade > > > Key: HDFS-11448 > URL: https://issues.apache.org/jira/browse/HDFS-11448 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Fix For: 3.0.0-alpha4 > > Attachments: HDFS-11448.001.patch, HDFS-11448.002.patch, > HDFS-11448.003.patch > > > HDFS-4025 adds support for sychronizing past log segments to JNs that missed > them. But, as pointed out by [~jingzhao], if the segment download happens > when an admin tries to rollback, it might fail ([see > comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16621) Remove unused JNStorage#getCurrentDir()
[ https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16621: Description: There is no use of getCurrentDir() anywhere in JNStorage, we should remove it. (was: In JNStorage, sd.getCurrentDir() is used in 5~6 places, It can be replaced with JNStorage#getCurrentDir(), which will be more concise.) > Remove unused JNStorage#getCurrentDir() > --- > > Key: HDFS-16621 > URL: https://issues.apache.org/jira/browse/HDFS-16621 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node, qjm >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > There is no use of getCurrentDir() anywhere in JNStorage, we should remove it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16621) Remove unused JNStorage#getCurrentDir()
[ https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16621: Summary: Remove unused JNStorage#getCurrentDir() (was: Replace sd.getCurrentDir() with JNStorage#getCurrentDir()) > Remove unused JNStorage#getCurrentDir() > --- > > Key: HDFS-16621 > URL: https://issues.apache.org/jira/browse/HDFS-16621 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node, qjm >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In JNStorage, sd.getCurrentDir() is used in 5~6 places, > It can be replaced with JNStorage#getCurrentDir(), which will be more concise. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
[ https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16621 started by JiangHua Zhu. --- > Replace sd.getCurrentDir() with JNStorage#getCurrentDir() > - > > Key: HDFS-16621 > URL: https://issues.apache.org/jira/browse/HDFS-16621 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node, qjm >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In JNStorage, sd.getCurrentDir() is used in 5~6 places, > It can be replaced with JNStorage#getCurrentDir(), which will be more concise. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
[ https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16621: --- Assignee: JiangHua Zhu > Replace sd.getCurrentDir() with JNStorage#getCurrentDir() > - > > Key: HDFS-16621 > URL: https://issues.apache.org/jira/browse/HDFS-16621 > Project: Hadoop HDFS > Issue Type: Improvement > Components: journal-node, qjm >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > > In JNStorage, sd.getCurrentDir() is used in 5~6 places, > It can be replaced with JNStorage#getCurrentDir(), which will be more concise. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
JiangHua Zhu created HDFS-16621: --- Summary: Replace sd.getCurrentDir() with JNStorage#getCurrentDir() Key: HDFS-16621 URL: https://issues.apache.org/jira/browse/HDFS-16621 Project: Hadoop HDFS Issue Type: Improvement Components: journal-node, qjm Affects Versions: 3.3.0 Reporter: JiangHua Zhu In JNStorage, sd.getCurrentDir() is used in 5~6 places, It can be replaced with JNStorage#getCurrentDir(), which will be more concise. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11448) JN log segment syncing should support HA upgrade
[ https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545848#comment-17545848 ] JiangHua Zhu commented on HDFS-11448: - Hi [~hanishakoneru], nice to communicate with you. I found the new addition of JNStorage#getCurrentDir() here, and yes, that's good because sd.getCurrentDir() is used in multiple places in the context, but there is no use of it anywhere. If you don't mind, I'll modify this to replace sd.getCurrentDir() with JNStorage#getCurrentDir(). > JN log segment syncing should support HA upgrade > > > Key: HDFS-11448 > URL: https://issues.apache.org/jira/browse/HDFS-11448 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Fix For: 3.0.0-alpha4 > > Attachments: HDFS-11448.001.patch, HDFS-11448.002.patch, > HDFS-11448.003.patch > > > HDFS-4025 adds support for sychronizing past log segments to JNs that missed > them. But, as pointed out by [~jingzhao], if the segment download happens > when an admin tries to rollback, it might fail ([see > comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16614) Improve balancer operation strategy and performance
[ https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16614: Description: When the Balancer program is run, it does some work in the following order: 1. Obtain available datanode information from NameNode. 2. Classify and calculate the average utilization according to StorageType. Here, some sets will be obtained in combination with the set thresholds: overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized. 3. According to some calculations, the source and target related to the transfer data are obtained. The source is used for the source end, and the target is used for the data receiving end. 4. Start the data transfer work in parallel. In this process, run iteratively. In this process, the threshold is unified and applied to all StorageTypes, which seems to be a bit rough, because one of the StorageTypes cannot be distinguished, which is based on the currently supported heterogeneous storage. There is an online cluster with more than 2000 nodes, and there is an imbalance in node storage. E.g: !image-2022-06-02-13-18-33-213.png! Here, the average utilization of the cluster is 78%, but the utilization of most nodes is between 85% and 90%. When the balancer is turned on, we find that 85% of the nodes are working as sources. In this case, we think it is not reasonable, because it will occupy more network resources in the cluster, and it will be beneficial to the normal work of the cluster to do some effective restrictions. So here are some changes to make: 1. When the balancer is running, we should actively prompt the suggested value of the threshold related to StorageType. For example: [[DISK, 10%], [SSD, 8%]...] 2. Support to set threshold according to StorageType and work. 3. Add an option to prohibit nodes below the threshold from joining the Source set. This is to allow nodes with high utilization to transfer data as soon as possible, which is good for balance. 4. Add new support. If there are a lot of datanode usage in the cluster, it should remain unchanged. For example, the utilization rate of 40% of the nodes in the cluster is 75% to 80%, and these nodes should not join the Source set. Of course this support needs to be specified by the user at runtime. was: When the Balancer program is run, it does some work in the following order: 1. Obtain available datanode information from NameNode. 2. Classify and calculate the average utilization according to StorageType. Here, some sets will be obtained in combination with the set thresholds: overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized. 3. According to some calculations, the source and target related to the transfer data are obtained. The source is used for the source end, and the target is used for the data receiving end. 4. Start the data transfer work in parallel. In this process, run iteratively. In this process, the threshold is unified and applied to all StorageTypes, which seems to be a bit rough, because one of the StorageTypes cannot be distinguished, which is based on the currently supported heterogeneous storage. There is an online cluster with more than 2000 nodes, and there is an imbalance in node storage. E.g: !image-2022-06-02-13-18-33-213.png! Here, the average utilization of the cluster is 78%, but the utilization of most nodes is between 85% and 90%. When the balancer is turned on, we find that 85% of the nodes are working as sources. In this case, we think it is not reasonable, because it will occupy more network resources in the cluster, and it will be beneficial to the normal work of the cluster to do some effective restrictions. So here are some changes to make: 1. When the balancer is running, it should try to prompt the threshold related to StorageType. For example [[DISK, 10%], [SSD, 8%]...] 2. Support to set threshold according to StorageType and work. 3. Add an option to prohibit nodes below the threshold from joining the Source set. This is to allow nodes with high utilization to transfer data as soon as possible, which is good for balance. 4. Add new support. If there are a lot of datanode usage in the cluster, it should remain unchanged. For example, the utilization rate of 40% of the nodes in the cluster is 75% to 80%, and these nodes should not join the Source set. Of course this support needs to be specified by the user at runtime. > Improve balancer operation strategy and performance > --- > > Key: HDFS-16614 > URL: https://issues.apache.org/jira/browse/HDFS-16614 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major >
[jira] [Updated] (HDFS-16614) Improve balancer operation strategy and performance
[ https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16614: Affects Version/s: 2.9.2 (was: 3.3.0) > Improve balancer operation strategy and performance > --- > > Key: HDFS-16614 > URL: https://issues.apache.org/jira/browse/HDFS-16614 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-06-02-13-18-33-213.png > > > When the Balancer program is run, it does some work in the following order: > 1. Obtain available datanode information from NameNode. > 2. Classify and calculate the average utilization according to StorageType. > Here, some sets will be obtained in combination with the set thresholds: > overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized. > 3. According to some calculations, the source and target related to the > transfer data are obtained. The source is used for the source end, and the > target is used for the data receiving end. > 4. Start the data transfer work in parallel. > In this process, run iteratively. In this process, the threshold is unified > and applied to all StorageTypes, which seems to be a bit rough, because one > of the StorageTypes cannot be distinguished, which is based on the currently > supported heterogeneous storage. > There is an online cluster with more than 2000 nodes, and there is an > imbalance in node storage. E.g: > !image-2022-06-02-13-18-33-213.png! > Here, the average utilization of the cluster is 78%, but the utilization of > most nodes is between 85% and 90%. When the balancer is turned on, we find > that 85% of the nodes are working as sources. In this case, we think it is > not reasonable, because it will occupy more network resources in the cluster, > and it will be beneficial to the normal work of the cluster to do some > effective restrictions. > So here are some changes to make: > 1. When the balancer is running, it should try to prompt the threshold > related to StorageType. For example [[DISK, 10%], [SSD, 8%]...] > 2. Support to set threshold according to StorageType and work. > 3. Add an option to prohibit nodes below the threshold from joining the > Source set. This is to allow nodes with high utilization to transfer data as > soon as possible, which is good for balance. > 4. Add new support. If there are a lot of datanode usage in the cluster, it > should remain unchanged. For example, the utilization rate of 40% of the > nodes in the cluster is 75% to 80%, and these nodes should not join the > Source set. Of course this support needs to be specified by the user at > runtime. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16614) Improve balancer operation strategy and performance
JiangHua Zhu created HDFS-16614: --- Summary: Improve balancer operation strategy and performance Key: HDFS-16614 URL: https://issues.apache.org/jira/browse/HDFS-16614 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Affects Versions: 3.3.0 Reporter: JiangHua Zhu Attachments: image-2022-06-02-13-18-33-213.png When the Balancer program is run, it does some work in the following order: 1. Obtain available datanode information from NameNode. 2. Classify and calculate the average utilization according to StorageType. Here, some sets will be obtained in combination with the set thresholds: overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized. 3. According to some calculations, the source and target related to the transfer data are obtained. The source is used for the source end, and the target is used for the data receiving end. 4. Start the data transfer work in parallel. In this process, run iteratively. In this process, the threshold is unified and applied to all StorageTypes, which seems to be a bit rough, because one of the StorageTypes cannot be distinguished, which is based on the currently supported heterogeneous storage. There is an online cluster with more than 2000 nodes, and there is an imbalance in node storage. E.g: !image-2022-06-02-13-18-33-213.png! Here, the average utilization of the cluster is 78%, but the utilization of most nodes is between 85% and 90%. When the balancer is turned on, we find that 85% of the nodes are working as sources. In this case, we think it is not reasonable, because it will occupy more network resources in the cluster, and it will be beneficial to the normal work of the cluster to do some effective restrictions. So here are some changes to make: 1. When the balancer is running, it should try to prompt the threshold related to StorageType. For example [[DISK, 10%], [SSD, 8%]...] 2. Support to set threshold according to StorageType and work. 3. Add an option to prohibit nodes below the threshold from joining the Source set. This is to allow nodes with high utilization to transfer data as soon as possible, which is good for balance. 4. Add new support. If there are a lot of datanode usage in the cluster, it should remain unchanged. For example, the utilization rate of 40% of the nodes in the cluster is 75% to 80%, and these nodes should not join the Source set. Of course this support needs to be specified by the user at runtime. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16614) Improve balancer operation strategy and performance
[ https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16614: --- Assignee: JiangHua Zhu > Improve balancer operation strategy and performance > --- > > Key: HDFS-16614 > URL: https://issues.apache.org/jira/browse/HDFS-16614 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-06-02-13-18-33-213.png > > > When the Balancer program is run, it does some work in the following order: > 1. Obtain available datanode information from NameNode. > 2. Classify and calculate the average utilization according to StorageType. > Here, some sets will be obtained in combination with the set thresholds: > overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized. > 3. According to some calculations, the source and target related to the > transfer data are obtained. The source is used for the source end, and the > target is used for the data receiving end. > 4. Start the data transfer work in parallel. > In this process, run iteratively. In this process, the threshold is unified > and applied to all StorageTypes, which seems to be a bit rough, because one > of the StorageTypes cannot be distinguished, which is based on the currently > supported heterogeneous storage. > There is an online cluster with more than 2000 nodes, and there is an > imbalance in node storage. E.g: > !image-2022-06-02-13-18-33-213.png! > Here, the average utilization of the cluster is 78%, but the utilization of > most nodes is between 85% and 90%. When the balancer is turned on, we find > that 85% of the nodes are working as sources. In this case, we think it is > not reasonable, because it will occupy more network resources in the cluster, > and it will be beneficial to the normal work of the cluster to do some > effective restrictions. > So here are some changes to make: > 1. When the balancer is running, it should try to prompt the threshold > related to StorageType. For example [[DISK, 10%], [SSD, 8%]...] > 2. Support to set threshold according to StorageType and work. > 3. Add an option to prohibit nodes below the threshold from joining the > Source set. This is to allow nodes with high utilization to transfer data as > soon as possible, which is good for balance. > 4. Add new support. If there are a lot of datanode usage in the cluster, it > should remain unchanged. For example, the utilization rate of 40% of the > nodes in the cluster is 75% to 80%, and these nodes should not join the > Source set. Of course this support needs to be specified by the user at > runtime. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works
[ https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542370#comment-17542370 ] JiangHua Zhu commented on HDFS-16594: - Thanks [~sodonnell] and [~weichiu] for your comments and following. > Many RpcCalls are blocked for a while while Decommission works > -- > > Key: HDFS-16594 > URL: https://issues.apache.org/jira/browse/HDFS-16594 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-05-26-02-05-38-878.png > > > When there are some DataNodes that need to go offline, Decommission starts to > work, and periodically checks the number of blocks remaining to be processed. > By default, when checking more than > 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the > DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. > If the number of blocks to be checked is very large, for example, the number > of replicas managed by the DataNode reaches 90w or even 100w, during this > period, the DatanodeAdminDefaultMonitor will continue to hold the > FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: > !image-2022-05-26-02-05-38-878.png! > It can be seen that in the last inspection process, there were more than 100w > blocks. > When the check is over, FSNamesystemLock is released and RpcCall starts > working: > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process > from client Call#5571549 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:35727 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process > from client Call#36795561 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:37793 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process > from client Call#5497586 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:23475 > ' > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process > from client Call#6043903 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:34746 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#274471 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:46419 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#73375524 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > ...:34241 > ' > Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will > be longer than usual. A very large number of RpcCalls were affected during > this time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16592) Fix typo for BalancingPolicy
[ https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu resolved HDFS-16592. - Resolution: Not A Problem > Fix typo for BalancingPolicy > > > Key: HDFS-16592 > URL: https://issues.apache.org/jira/browse/HDFS-16592 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, documentation, namenode >Affects Versions: 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Attachments: image-2022-05-24-11-29-14-019.png > > Time Spent: 1h > Remaining Estimate: 0h > > !image-2022-05-24-11-29-14-019.png! > 'NOT' should be changed to lowercase rather than uppercase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works
[ https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16594: Description: When there are some DataNodes that need to go offline, Decommission starts to work, and periodically checks the number of blocks remaining to be processed. By default, when checking more than 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. If the number of blocks to be checked is very large, for example, the number of replicas managed by the DataNode reaches 90w or even 100w, during this period, the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: !image-2022-05-26-02-05-38-878.png! It can be seen that in the last inspection process, there were more than 100w blocks. When the check is over, FSNamesystemLock is released and RpcCall starts working: ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process from client Call#5571549 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:35727 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process from client Call#36795561 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:37793 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process from client Call#5497586 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:23475 ' ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process from client Call#6043903 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:34746 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process from client Call#274471 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:46419 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process from client Call#73375524 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from ...:34241 ' Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will be longer than usual. A very large number of RpcCalls were affected during this time. was: When there are some DataNodes that need to go offline, Decommission starts to work, and periodically checks the number of blocks remaining to be processed. By default, when checking more than 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. If the number of blocks to be checked is very large, for example, the number of replicas managed by the DataNode reaches 90w or even 100w, during this period, the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: !image-2022-05-26-02-05-38-878.png! It can be seen that in the last inspection process, there were more than 100w blocks. When the check is over, FSNamesystemLock is released and RpcCall starts working: ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process from client Call#5571549 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.145.92:35727 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process from client Call#36795561 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.99.152:37793 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process from client Call#5497586 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.146.56:23475 ' ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process from client Call#6043903 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.82.106:34746 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler
[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works
[ https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542190#comment-17542190 ] JiangHua Zhu commented on HDFS-16594: - In my opinion, the priority of RpcCall being processed on time is relatively high, and the time that DatanodeAdminDefaultMonitor holds FSNamesystemLock cannot be too long. Here are 2 ways to optimize: 1. The default value of ${dfs.namenode.decommission.blocks.per.interval} can be lowered, such as 1 or 2. 2. When DatanodeAdminDefaultMonitor is working, increase the time slice processing. For example, when DatanodeAdminDefaultMonitor works for more than 500ms, it is forced to sleep for 10ms, and then restarts. We can choose one of these 2 methods. [~weichiu] [~ayushtkn], do you guys have some good suggestions? > Many RpcCalls are blocked for a while while Decommission works > -- > > Key: HDFS-16594 > URL: https://issues.apache.org/jira/browse/HDFS-16594 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-05-26-02-05-38-878.png > > > When there are some DataNodes that need to go offline, Decommission starts to > work, and periodically checks the number of blocks remaining to be processed. > By default, when checking more than > 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the > DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. > If the number of blocks to be checked is very large, for example, the number > of replicas managed by the DataNode reaches 90w or even 100w, during this > period, the DatanodeAdminDefaultMonitor will continue to hold the > FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: > !image-2022-05-26-02-05-38-878.png! > It can be seen that in the last inspection process, there were more than 100w > blocks. > When the check is over, FSNamesystemLock is released and RpcCall starts > working: > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process > from client Call#5571549 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.145.92:35727 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process > from client Call#36795561 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.99.152:37793 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process > from client Call#5497586 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.146.56:23475 > ' > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process > from client Call#6043903 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.82.106:34746 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#274471 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.149.175:46419 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#73375524 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.81.46:34241 > ' > Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will > be longer than usual. A very large number of RpcCalls were affected during > this time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works
[ https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16594: --- Assignee: JiangHua Zhu > Many RpcCalls are blocked for a while while Decommission works > -- > > Key: HDFS-16594 > URL: https://issues.apache.org/jira/browse/HDFS-16594 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 2.9.2 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-05-26-02-05-38-878.png > > > When there are some DataNodes that need to go offline, Decommission starts to > work, and periodically checks the number of blocks remaining to be processed. > By default, when checking more than > 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the > DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. > If the number of blocks to be checked is very large, for example, the number > of replicas managed by the DataNode reaches 90w or even 100w, during this > period, the DatanodeAdminDefaultMonitor will continue to hold the > FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: > !image-2022-05-26-02-05-38-878.png! > It can be seen that in the last inspection process, there were more than 100w > blocks. > When the check is over, FSNamesystemLock is released and RpcCall starts > working: > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process > from client Call#5571549 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.145.92:35727 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process > from client Call#36795561 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.99.152:37793 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process > from client Call#5497586 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.146.56:23475 > ' > ' > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process > from client Call#6043903 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.82.106:34746 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#274471 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.149.175:46419 > 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on > 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process > from client Call#73375524 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from > 10.196.81.46:34241 > ' > Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will > be longer than usual. A very large number of RpcCalls were affected during > this time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works
JiangHua Zhu created HDFS-16594: --- Summary: Many RpcCalls are blocked for a while while Decommission works Key: HDFS-16594 URL: https://issues.apache.org/jira/browse/HDFS-16594 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 2.9.2 Reporter: JiangHua Zhu Attachments: image-2022-05-26-02-05-38-878.png When there are some DataNodes that need to go offline, Decommission starts to work, and periodically checks the number of blocks remaining to be processed. By default, when checking more than 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the DatanodeAdminDefaultMonitor thread will sleep for a while before continuing. If the number of blocks to be checked is very large, for example, the number of replicas managed by the DataNode reaches 90w or even 100w, during this period, the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs: !image-2022-05-26-02-05-38-878.png! It can be seen that in the last inspection process, there were more than 100w blocks. When the check is over, FSNamesystemLock is released and RpcCall starts working: ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process from client Call#5571549 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.145.92:35727 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process from client Call#36795561 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.99.152:37793 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process from client Call#5497586 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.146.56:23475 ' ' 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process from client Call#6043903 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.82.106:34746 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process from client Call#274471 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.149.175:46419 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process from client Call#73375524 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 10.196.81.46:34241 ' Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will be longer than usual. A very large number of RpcCalls were affected during this time. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16592) Fix typo for BalancingPolicy
[ https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16592: Component/s: documentation > Fix typo for BalancingPolicy > > > Key: HDFS-16592 > URL: https://issues.apache.org/jira/browse/HDFS-16592 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, documentation, namenode >Affects Versions: 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Attachments: image-2022-05-24-11-29-14-019.png > > Time Spent: 10m > Remaining Estimate: 0h > > !image-2022-05-24-11-29-14-019.png! > 'NOT' should be changed to lowercase rather than uppercase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16592) Fix typo for BalancingPolicy
[ https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16592 started by JiangHua Zhu. --- > Fix typo for BalancingPolicy > > > Key: HDFS-16592 > URL: https://issues.apache.org/jira/browse/HDFS-16592 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Labels: pull-request-available > Attachments: image-2022-05-24-11-29-14-019.png > > Time Spent: 10m > Remaining Estimate: 0h > > !image-2022-05-24-11-29-14-019.png! > 'NOT' should be changed to lowercase rather than uppercase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16592) Fix typo for BalancingPolicy
[ https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16592: --- Assignee: JiangHua Zhu > Fix typo for BalancingPolicy > > > Key: HDFS-16592 > URL: https://issues.apache.org/jira/browse/HDFS-16592 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Minor > Attachments: image-2022-05-24-11-29-14-019.png > > > !image-2022-05-24-11-29-14-019.png! > 'NOT' should be changed to lowercase rather than uppercase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16592) Fix typo for BalancingPolicy
JiangHua Zhu created HDFS-16592: --- Summary: Fix typo for BalancingPolicy Key: HDFS-16592 URL: https://issues.apache.org/jira/browse/HDFS-16592 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Affects Versions: 3.4.0 Reporter: JiangHua Zhu Attachments: image-2022-05-24-11-29-14-019.png !image-2022-05-24-11-29-14-019.png! 'NOT' should be changed to lowercase rather than uppercase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16581) Print node status when executing printTopology
[ https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16581: Summary: Print node status when executing printTopology (was: Print DataNode node status) > Print node status when executing printTopology > -- > > Key: HDFS-16581 > URL: https://issues.apache.org/jira/browse/HDFS-16581 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsadmin, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > We can use the dfsadmin tool to see which DataNodes the cluster has, and some > of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It > would be helpful if we could get this information in a timely manner, such as > troubleshooting cluster failures, tracking node status, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches
[ https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16588 started by JiangHua Zhu. --- > Backport HDFS-16584 to branch-3.3 and other active old branches > --- > > Key: HDFS-16588 > URL: https://issues.apache.org/jira/browse/HDFS-16588 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This issue has been dealt with in trunk and again needs to be backported to > branch-3.3 or another active branch. > See HDFS-16584. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches
[ https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16588: --- Assignee: JiangHua Zhu > Backport HDFS-16584 to branch-3.3 and other active old branches > --- > > Key: HDFS-16588 > URL: https://issues.apache.org/jira/browse/HDFS-16588 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > This issue has been dealt with in trunk and again needs to be backported to > branch-3.3 or another active branch. > See HDFS-16584. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches
JiangHua Zhu created HDFS-16588: --- Summary: Backport HDFS-16584 to branch-3.3 and other active old branches Key: HDFS-16588 URL: https://issues.apache.org/jira/browse/HDFS-16588 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: JiangHua Zhu This issue has been dealt with in trunk and again needs to be backported to branch-3.3 or another active branch. See HDFS-16584. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16584) Record StandbyNameNode information when Balancer is running
[ https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16584 started by JiangHua Zhu. --- > Record StandbyNameNode information when Balancer is running > --- > > Key: HDFS-16584 > URL: https://issues.apache.org/jira/browse/HDFS-16584 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Attachments: image-2022-05-19-20-23-23-825.png > > Time Spent: 10m > Remaining Estimate: 0h > > When the Balancer is running, we allow block data to be fetched from the > StandbyNameNode, which is nice. Here are some logs: > !image-2022-05-19-20-23-23-825.png! > But we have no way of knowing which NameNode the request was made to. We > should log more detailed information, such as the host associated with the > StandbyNameNode. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16584) Record StandbyNameNode information when Balancer is running
[ https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16584: --- Assignee: JiangHua Zhu > Record StandbyNameNode information when Balancer is running > --- > > Key: HDFS-16584 > URL: https://issues.apache.org/jira/browse/HDFS-16584 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer mover, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: image-2022-05-19-20-23-23-825.png > > > When the Balancer is running, we allow block data to be fetched from the > StandbyNameNode, which is nice. Here are some logs: > !image-2022-05-19-20-23-23-825.png! > But we have no way of knowing which NameNode the request was made to. We > should log more detailed information, such as the host associated with the > StandbyNameNode. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16584) Record StandbyNameNode information when Balancer is running
JiangHua Zhu created HDFS-16584: --- Summary: Record StandbyNameNode information when Balancer is running Key: HDFS-16584 URL: https://issues.apache.org/jira/browse/HDFS-16584 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Affects Versions: 3.3.0 Reporter: JiangHua Zhu Attachments: image-2022-05-19-20-23-23-825.png When the Balancer is running, we allow block data to be fetched from the StandbyNameNode, which is nice. Here are some logs: !image-2022-05-19-20-23-23-825.png! But we have no way of knowing which NameNode the request was made to. We should log more detailed information, such as the host associated with the StandbyNameNode. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16581) Print DataNode node status
[ https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16581 started by JiangHua Zhu. --- > Print DataNode node status > -- > > Key: HDFS-16581 > URL: https://issues.apache.org/jira/browse/HDFS-16581 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsadmin, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > We can use the dfsadmin tool to see which DataNodes the cluster has, and some > of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It > would be helpful if we could get this information in a timely manner, such as > troubleshooting cluster failures, tracking node status, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu resolved HDFS-16565. - Resolution: Duplicate > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ec >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16581) Print DataNode node status
[ https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16581: --- Assignee: JiangHua Zhu > Print DataNode node status > -- > > Key: HDFS-16581 > URL: https://issues.apache.org/jira/browse/HDFS-16581 > Project: Hadoop HDFS > Issue Type: Improvement > Components: dfsadmin, namenode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > We can use the dfsadmin tool to see which DataNodes the cluster has, and some > of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It > would be helpful if we could get this information in a timely manner, such as > troubleshooting cluster failures, tracking node status, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16581) Print DataNode node status
JiangHua Zhu created HDFS-16581: --- Summary: Print DataNode node status Key: HDFS-16581 URL: https://issues.apache.org/jira/browse/HDFS-16581 Project: Hadoop HDFS Issue Type: Improvement Components: dfsadmin, namenode Affects Versions: 3.3.0 Reporter: JiangHua Zhu We can use the dfsadmin tool to see which DataNodes the cluster has, and some of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It would be helpful if we could get this information in a timely manner, such as troubleshooting cluster failures, tracking node status, etc. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16576) Remove unused Imports in Hadoop HDFS project
[ https://issues.apache.org/jira/browse/HDFS-16576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535817#comment-17535817 ] JiangHua Zhu commented on HDFS-16576: - It looks like what is described here is somewhat simple. > Remove unused Imports in Hadoop HDFS project > > > Key: HDFS-16576 > URL: https://issues.apache.org/jira/browse/HDFS-16576 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ashutosh Gupta >Assignee: Ashutosh Gupta >Priority: Minor > > h3. Optimize Imports to keep code clean > # Remove any unused imports -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Component/s: ec > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, ec >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533457#comment-17533457 ] JiangHua Zhu commented on HDFS-16565: - A problem with socket leaks caused by StripedBlockChecksumReconstructor was found here. Here are some logs from the online cluster: 2022-05-07 13:01:46,798 WARN org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper: Exception while reading checksum java.net.SocketTimeoutException: 3000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.198.108.108:17834 remote=/10.198.109.181:1004] Here is the source code for version 3.3.x, in BlockChecksumHelper: !screenshot-2.png! This issue has been reported in HDFS-15709. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Attachment: screenshot-2.png > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532808#comment-17532808 ] JiangHua Zhu edited comment on HDFS-16565 at 5/6/22 11:46 AM: -- Thanks [~hexiaoqiao] for the comment. This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is the phenomenon on 1 of the DataNodes: jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP hadoop-ec482.xxx:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT) Here, hadoop-ec482.xxx is the local DataNode node. You can see that when connecting to other nodes, a random port is used, but eventually the connection here will remain for a long time and will not be released. I guess the problem is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket properly. On our cluster, there are 3 ways to use it: 1. Use HDFS Client Api to store EC data. 2. The data is copied or transferred when the DataNode is forced to go offline, or when the balancer is executed. 3. A small amount of storage multi-copy data occurs. I'm still investigating the exact cause of what's happening here. [~hexiaoqiao], do you have some better suggestions. Thank you very much. was (Author: jianghuazhu): Thanks [~hexiaoqiao] for the comment. This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is the phenomenon on 1 of the DataNodes: jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP hadoop-ec482.xxx.org:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT) Here, hadoop-ec482.xxx is the local DataNode node. You can see that when connecting to other nodes, a random port is used, but eventually the connection here will remain for a long time and will not be released. I guess the problem is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket properly. On our cluster, there are 3 ways to use it: 1. Use HDFS Client Api to store EC data. 2. The data is copied or transferred when the DataNode is forced to go offline, or when the balancer is executed. 3. A small amount of storage multi-copy data occurs. I'm still investigating the exact cause of what's happening here. [~hexiaoqiao], do you have some better suggestions. Thank you very much. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532808#comment-17532808 ] JiangHua Zhu commented on HDFS-16565: - Thanks [~hexiaoqiao] for the comment. This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is the phenomenon on 1 of the DataNodes: jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP hadoop-ec482.xxx.org:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT) jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT) Here, hadoop-ec482.xxx is the local DataNode node. You can see that when connecting to other nodes, a random port is used, but eventually the connection here will remain for a long time and will not be released. I guess the problem is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket properly. On our cluster, there are 3 ways to use it: 1. Use HDFS Client Api to store EC data. 2. The data is copied or transferred when the DataNode is forced to go offline, or when the balancer is executed. 3. A small amount of storage multi-copy data occurs. I'm still investigating the exact cause of what's happening here. [~hexiaoqiao], do you have some better suggestions. Thank you very much. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Issue Type: Bug (was: Improvement) > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Description: There is a strange phenomenon here, DataNode holds a large number of connections in CLOSE_WAIT state and does not release. netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' LISTEN 20 CLOSE_WAIT 17707 ESTABLISHED 1450 TIME_WAIT 12 It can be found that the connections with the CLOSE_WAIT state have reached 17k and are still growing. View these CLOSE_WAITs through the lsof command, and get the following phenomenon: lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' !screenshot-1.png! It can be seen that the reason for this phenomenon is that Socket#close() is not called correctly, and DataNode interacts with other nodes as Client. was: There is a strange phenomenon here, DataNode holds a large number of connections in CLOSE_WAIT state and does not release. netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' LISTEN 20 CLOSE_WAIT 17707 ESTABLISHED 1450 TIME_WAIT 12 It can be found that the connections with the CLOSE_WAIT state have reached 17k and are still growing. View these CLOSE_WAITs through the lsof command, and get the following phenomenon: lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' !screenshot-1.png! It can be seen that the reason for this phenomenon is that Socket#close() is not called correctly, and DataNode interacts with other nodes as Client. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Environment: CentOS Linux release 7.5.1804 (Core) > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 > Environment: CentOS Linux release 7.5.1804 (Core) >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16565: --- Assignee: (was: JiangHua Zhu) > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Attachment: screenshot-1.png > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > When DataTransfer runs, the local node needs to connect to another DataNode, > which is through socket. Once the connection fails, a NoRouteToHostException > will be generated. > Exception information: > 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(...:1004, > datanodeUuid=..., infoPort=1006 , infoSecurePort=0, > ipcPort=8025, > storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed > to transfer BP-1375239094-...- > 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got > java.net.NoRouteToHostException: No route to host > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > The source of the accident: > sock = newSocket(); > NetUtils.connect(sock, curTarget, dnConf.socketTimeout); > sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); > sock.setSoTimeout(targets.length * dnConf.socketTimeout); > When a NoRouteToHostException occurs, the Block will be added to the > VolumeScanner, and the VolumeScanner will start working to scan the Block. > This should not happen because this is not a real IOException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Description: There is a strange phenomenon here, DataNode holds a large number of connections in CLOSE_WAIT state and does not release. netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' LISTEN 20 CLOSE_WAIT 17707 ESTABLISHED 1450 TIME_WAIT 12 It can be found that the connections with the CLOSE_WAIT state have reached 17k and are still growing. View these CLOSE_WAITs through the lsof command, and get the following phenomenon: lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' !screenshot-1.png! It can be seen that the reason for this phenomenon is that Socket#close() is not called correctly, and DataNode interacts with other nodes as Client. was: When DataTransfer runs, the local node needs to connect to another DataNode, which is through socket. Once the connection fails, a NoRouteToHostException will be generated. Exception information: 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(...:1004, datanodeUuid=..., infoPort=1006 , infoSecurePort=0, ipcPort=8025, storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed to transfer BP-1375239094-...- 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The source of the accident: sock = newSocket(); NetUtils.connect(sock, curTarget, dnConf.socketTimeout); sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); sock.setSoTimeout(targets.length * dnConf.socketTimeout); When a NoRouteToHostException occurs, the Block will be added to the VolumeScanner, and the VolumeScanner will start working to scan the Block. This should not happen because this is not a real IOException. > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Attachments: screenshot-1.png > > > There is a strange phenomenon here, DataNode holds a large number of > connections in CLOSE_WAIT state and does not release. > netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}' > LISTEN 20 > CLOSE_WAIT 17707 > ESTABLISHED 1450 > TIME_WAIT 12 > It can be found that the connections with the CLOSE_WAIT state have reached > 17k and are still growing. View these CLOSE_WAITs through the lsof command, > and get the following phenomenon: > lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND' > !screenshot-1.png! > It can be seen that the reason for this phenomenon is that Socket#close() is > not called correctly, and DataNode interacts with other nodes as Client. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Summary: DataNode holds a large number of CLOSE_WAIT connections that are not released (was: Optimize DataNode#DataTransfer, when encountering NoRouteToHostException) > DataNode holds a large number of CLOSE_WAIT connections that are not released > - > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > When DataTransfer runs, the local node needs to connect to another DataNode, > which is through socket. Once the connection fails, a NoRouteToHostException > will be generated. > Exception information: > 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(...:1004, > datanodeUuid=..., infoPort=1006 , infoSecurePort=0, > ipcPort=8025, > storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed > to transfer BP-1375239094-...- > 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got > java.net.NoRouteToHostException: No route to host > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > The source of the accident: > sock = newSocket(); > NetUtils.connect(sock, curTarget, dnConf.socketTimeout); > sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); > sock.setSoTimeout(targets.length * dnConf.socketTimeout); > When a NoRouteToHostException occurs, the Block will be added to the > VolumeScanner, and the VolumeScanner will start working to scan the Block. > This should not happen because this is not a real IOException. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16565: Description: When DataTransfer runs, the local node needs to connect to another DataNode, which is through socket. Once the connection fails, a NoRouteToHostException will be generated. Exception information: 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(...:1004, datanodeUuid=..., infoPort=1006 , infoSecurePort=0, ipcPort=8025, storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed to transfer BP-1375239094-...- 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The source of the accident: sock = newSocket(); NetUtils.connect(sock, curTarget, dnConf.socketTimeout); sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); sock.setSoTimeout(targets.length * dnConf.socketTimeout); When a NoRouteToHostException occurs, the Block will be added to the VolumeScanner, and the VolumeScanner will start working to scan the Block. This should not happen because this is not a real IOException. was: When DataTransfer runs, the local node needs to connect to another DataNode, which is through socket. Once the connection fails, a NoRouteToHostException will be generated. Exception information: 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(...:1004, datanodeUuid=..., infoPort=1006 , infoSecurePort=0, ipcPort=8025, storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed to transfer BP-1375239094-...- 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The source of the accident: sock = newSocket(); NetUtils.connect(sock, curTarget, dnConf.socketTimeout); sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); sock.setSoTimeout(targets.length * dnConf.socketTimeout); When a NoRouteToHostException occurs, the Block will be added to the VolumeScanner, and the VolumeScanner will start working to scan the Block. This should not happen because this is not a real IOException. catch (IOException ie) { handleBadBlock(b, ie, false); LOG.warn("{}:Failed to transfer {} to {} got", bpReg, b, targets[0], ie); } > Optimize DataNode#DataTransfer, when encountering NoRouteToHostException > > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > When DataTransfer runs, the local node needs to connect to another DataNode, > which is through socket. Once the connection fails, a NoRouteToHostException > will be generated. > Exception information: > 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(...:1004, > datanodeUuid=..., infoPort=1006 , infoSecurePort=0, > ipcPort=8025, > storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed > to transfer BP-1375239094-...- >
[jira] [Assigned] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException
[ https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16565: --- Assignee: JiangHua Zhu > Optimize DataNode#DataTransfer, when encountering NoRouteToHostException > > > Key: HDFS-16565 > URL: https://issues.apache.org/jira/browse/HDFS-16565 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > When DataTransfer runs, the local node needs to connect to another DataNode, > which is through socket. Once the connection fails, a NoRouteToHostException > will be generated. > Exception information: > 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(...:1004, > datanodeUuid=..., infoPort=1006 , infoSecurePort=0, > ipcPort=8025, > storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed > to transfer BP-1375239094-...- > 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got > java.net.NoRouteToHostException: No route to host > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > The source of the accident: > sock = newSocket(); > NetUtils.connect(sock, curTarget, dnConf.socketTimeout); > sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); > sock.setSoTimeout(targets.length * dnConf.socketTimeout); > When a NoRouteToHostException occurs, the Block will be added to the > VolumeScanner, and the VolumeScanner will start working to scan the Block. > This should not happen because this is not a real IOException. > catch (IOException ie) { > handleBadBlock(b, ie, false); > LOG.warn("{}:Failed to transfer {} to {} got", > bpReg, b, targets[0], ie); > } -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException
JiangHua Zhu created HDFS-16565: --- Summary: Optimize DataNode#DataTransfer, when encountering NoRouteToHostException Key: HDFS-16565 URL: https://issues.apache.org/jira/browse/HDFS-16565 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.3.0 Reporter: JiangHua Zhu When DataTransfer runs, the local node needs to connect to another DataNode, which is through socket. Once the connection fails, a NoRouteToHostException will be generated. Exception information: 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(...:1004, datanodeUuid=..., infoPort=1006 , infoSecurePort=0, ipcPort=8025, storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed to transfer BP-1375239094-...- 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got java.net.NoRouteToHostException: No route to host at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The source of the accident: sock = newSocket(); NetUtils.connect(sock, curTarget, dnConf.socketTimeout); sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay()); sock.setSoTimeout(targets.length * dnConf.socketTimeout); When a NoRouteToHostException occurs, the Block will be added to the VolumeScanner, and the VolumeScanner will start working to scan the Block. This should not happen because this is not a real IOException. catch (IOException ie) { handleBadBlock(b, ie, false); LOG.warn("{}:Failed to transfer {} to {} got", bpReg, b, targets[0], ie); } -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16498) Fix NPE for checkBlockReportLease
[ https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503966#comment-17503966 ] JiangHua Zhu commented on HDFS-16498: - This seems to be a robustness issue with the NameNode. A normal BlockReport workflow: 1. The DataNode registers itself with the NameNode. 2. The DataNode sends a reporting request to the NameNode. 3.NameNode starts working for BlockReport. It looks like this happens when the NameNode and DataNode restart at the same time. When this happens, should the log prompt become warn or Info more appropriate. This is just my thought. !screenshot-1.png! > Fix NPE for checkBlockReportLease > - > > Key: HDFS-16498 > URL: https://issues.apache.org/jira/browse/HDFS-16498 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > During the restart of Namenode, a Datanode is not registered, but this > Datanode triggers FBR, which causes NPE. > !image-2022-03-09-20-35-22-028.png|width=871,height=158! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease
[ https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu updated HDFS-16498: Attachment: screenshot-1.png > Fix NPE for checkBlockReportLease > - > > Key: HDFS-16498 > URL: https://issues.apache.org/jira/browse/HDFS-16498 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: tomscut >Assignee: tomscut >Priority: Major > Labels: pull-request-available > Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > During the restart of Namenode, a Datanode is not registered, but this > Datanode triggers FBR, which causes NPE. > !image-2022-03-09-20-35-22-028.png|width=871,height=158! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
[ https://issues.apache.org/jira/browse/HDFS-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16494 started by JiangHua Zhu. --- > Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks() > --- > > Key: HDFS-16494 > URL: https://issues.apache.org/jira/browse/HDFS-16494 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.9.2, 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When building the AvailableSpaceVolumeChoosingPolicy, if the default > constructor is used, initLocks() will be used twice, which is actually > unnecessary. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
JiangHua Zhu created HDFS-16494: --- Summary: Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks() Key: HDFS-16494 URL: https://issues.apache.org/jira/browse/HDFS-16494 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 2.9.2, 3.4.0 Reporter: JiangHua Zhu When building the AvailableSpaceVolumeChoosingPolicy, if the default constructor is used, initLocks() will be used twice, which is actually unnecessary. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
[ https://issues.apache.org/jira/browse/HDFS-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JiangHua Zhu reassigned HDFS-16494: --- Assignee: JiangHua Zhu > Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks() > --- > > Key: HDFS-16494 > URL: https://issues.apache.org/jira/browse/HDFS-16494 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 2.9.2, 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > > When building the AvailableSpaceVolumeChoosingPolicy, if the default > constructor is used, initLocks() will be used twice, which is actually > unnecessary. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16476) Increase the number of metrics used to record PendingRecoveryBlocks
[ https://issues.apache.org/jira/browse/HDFS-16476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16476 started by JiangHua Zhu. --- > Increase the number of metrics used to record PendingRecoveryBlocks > --- > > Key: HDFS-16476 > URL: https://issues.apache.org/jira/browse/HDFS-16476 > Project: Hadoop HDFS > Issue Type: Improvement > Components: metrics, namenode >Affects Versions: 2.9.2, 3.4.0 >Reporter: JiangHua Zhu >Assignee: JiangHua Zhu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The complete process of block recovery is as follows: > 1. NameNode collects which blocks need to be recovered. > 2. The NameNode issues instructions to some DataNodes for execution. > 3. DataNode tells NameNode after execution is complete. > Now there is no way to know how many blocks are being recovered. The number > of metrics used to record PendingRecoveryBlocks should be increased, which is > good for increasing the robustness of the cluster. > Here are some logs of DataNode execution: > 2022-02-10 23:51:04,386 [12208592621] - INFO [IPC Server handler 38 on > 8025:FsDatasetImpl@2687] - initReplicaRecovery: changing replica state for > blk_ from RBW to RUR > 2022-02-10 23:51:04,395 [12208592630] - INFO [IPC Server handler 47 on > 8025:FsDatasetImpl@2708] - updateReplica: BP-:blk_, > recoveryId=18386356475, length=129869866, replica=ReplicaUnderRecovery, > blk_, RUR > Here are some logs that NameNdoe receives after completion: > 2022-02-22 10:43:58,780 [8193058814] - INFO [IPC Server handler 15 on > 8021:FSNamesystem@3647] - commitBlockSynchronization(oldBlock=BP-, > newgenerationstamp=18551926574, newlength=16929, newtargets=[1:1004, > 2:1004, 3:1004]) successful -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org