[jira] [Updated] (HDFS-17228) Add documentation related to BlockManager

2023-10-17 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-17228:

Component/s: documentation

> Add documentation related to BlockManager
> -
>
> Key: HDFS-17228
> URL: https://issues.apache.org/jira/browse/HDFS-17228
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: block placement, documentation
>Affects Versions: 3.3.3, 3.3.6
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
> Attachments: image-2023-10-17-17-25-27-363.png
>
>
> In the BlockManager file, some important comments are missing.
> Happens here:
>  !image-2023-10-17-17-25-27-363.png! 
> If it is improved, the robustness of the distributed system can be increased.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17228) Add documentation related to BlockManager

2023-10-17 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-17228:
---

 Summary: Add documentation related to BlockManager
 Key: HDFS-17228
 URL: https://issues.apache.org/jira/browse/HDFS-17228
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: block placement
Affects Versions: 3.3.6, 3.3.3
Reporter: JiangHua Zhu
 Attachments: image-2023-10-17-17-25-27-363.png

In the BlockManager file, some important comments are missing.
Happens here:
 !image-2023-10-17-17-25-27-363.png! 

If it is improved, the robustness of the distributed system can be increased.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17228) Add documentation related to BlockManager

2023-10-17 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-17228:
---

Assignee: JiangHua Zhu

> Add documentation related to BlockManager
> -
>
> Key: HDFS-17228
> URL: https://issues.apache.org/jira/browse/HDFS-17228
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: block placement
>Affects Versions: 3.3.3, 3.3.6
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
> Attachments: image-2023-10-17-17-25-27-363.png
>
>
> In the BlockManager file, some important comments are missing.
> Happens here:
>  !image-2023-10-17-17-25-27-363.png! 
> If it is improved, the robustness of the distributed system can be increased.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT

2023-05-15 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-17012:

Attachment: screenshot-1.png

> Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
> 
>
> Key: HDFS-17012
> URL: https://issues.apache.org/jira/browse/HDFS-17012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have 
> been used anywhere, this is a redundant option and we should remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT

2023-05-15 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-17012:

Description: 
In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have 
been used anywhere, this is a redundant option and we should remove it.
 !screenshot-1.png! 

  was:In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to 
have been used anywhere, this is a redundant option and we should remove it.


> Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
> 
>
> Key: HDFS-17012
> URL: https://issues.apache.org/jira/browse/HDFS-17012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have 
> been used anywhere, this is a redundant option and we should remove it.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT

2023-05-15 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-17012:
---

 Summary: Remove unused 
DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
 Key: HDFS-17012
 URL: https://issues.apache.org/jira/browse/HDFS-17012
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, hdfs
Affects Versions: 3.3.4
Reporter: JiangHua Zhu


In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have 
been used anywhere, this is a redundant option and we should remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17012) Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT

2023-05-15 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-17012:
---

Assignee: JiangHua Zhu

> Remove unused DFSConfigKeys#DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT
> 
>
> Key: HDFS-17012
> URL: https://issues.apache.org/jira/browse/HDFS-17012
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>
> In DFSConfigKeys, DFS_DATANODE_PMEM_CACHE_DIRS_DEFAULT doesn't seem to have 
> been used anywhere, this is a redundant option and we should remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16863) Optimize frequency of regular block reports

2023-01-06 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655400#comment-17655400
 ] 

JiangHua Zhu commented on HDFS-16863:
-

[~yuyanlei], if FBR is reduced, will it have any new impact:
1. Some duplicates exist on Datanodes. NameNode should be notified but is not 
notified in time.
2. Complete the copy data saved by NameNode.

> Optimize frequency of regular block reports
> ---
>
> Key: HDFS-16863
> URL: https://issues.apache.org/jira/browse/HDFS-16863
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yanlei Yu
>Priority: Major
> Attachments: HDFS-16863.patch
>
>
> like  HDFS-15162
> Avoid sending block report at regular interval, if there is no failover, 
> DiskError or any exception encountered in connecting to the Namenode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface

2022-10-20 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16807:

Affects Version/s: 2.9.2

> Improve legacy ClientProtocol#rename2() interface
> -
>
> Key: HDFS-16807
> URL: https://issues.apache.org/jira/browse/HDFS-16807
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Affects Versions: 2.9.2, 3.3.3
>Reporter: JiangHua Zhu
>Priority: Major
>
> In HDFS-2298, rename2() replaced rename(), which is a very meaningful 
> improvement. It looks like some old customs are still preserved, they are:
> 1. When using the shell to execute the mv command, rename() is still used.
> ./bin/hdfs dfs -mv [source] [target]
> {code:java}
> In MoveCommands#Rename:
> protected void processPath(PathData src, PathData target) throws 
> IOException {
>   ..
>   if (!target.fs.rename(src.path, target.path)) {
> // we have no way to know the actual error...
> throw new PathIOException(src.toString());
>   }
> }
> {code}
> 2. When NNThroughputBenchmark verifies the rename.
> In NNThroughputBenchmark#RenameFileStats:
> {code:java}
> long executeOp(int daemonId, int inputIdx, String ignore)
> throws IOException {
>   long start = Time.now();
>   clientProto.rename(fileNames[daemonId][inputIdx],
>   destNames[daemonId][inputIdx]);
>   long end = Time.now();
>   return end-start;
> }
> {code}
> I think the interface should be kept uniform since rename() is deprecated. 
> For NNThroughputBenchmark, it's easy. But it is not easy to improve 
> MoveCommands, because it involves the transformation of FileSystem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface

2022-10-19 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17620206#comment-17620206
 ] 

JiangHua Zhu commented on HDFS-16807:
-

Can you guys post some suggestions? [~weichiu] [~aajisaka] [~hexiaoqiao] 
[~steve_l] [~ayushtkn].
Any suggestion is fine.


> Improve legacy ClientProtocol#rename2() interface
> -
>
> Key: HDFS-16807
> URL: https://issues.apache.org/jira/browse/HDFS-16807
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Priority: Major
>
> In HDFS-2298, rename2() replaced rename(), which is a very meaningful 
> improvement. It looks like some old customs are still preserved, they are:
> 1. When using the shell to execute the mv command, rename() is still used.
> ./bin/hdfs dfs -mv [source] [target]
> {code:java}
> In MoveCommands#Rename:
> protected void processPath(PathData src, PathData target) throws 
> IOException {
>   ..
>   if (!target.fs.rename(src.path, target.path)) {
> // we have no way to know the actual error...
> throw new PathIOException(src.toString());
>   }
> }
> {code}
> 2. When NNThroughputBenchmark verifies the rename.
> In NNThroughputBenchmark#RenameFileStats:
> {code:java}
> long executeOp(int daemonId, int inputIdx, String ignore)
> throws IOException {
>   long start = Time.now();
>   clientProto.rename(fileNames[daemonId][inputIdx],
>   destNames[daemonId][inputIdx]);
>   long end = Time.now();
>   return end-start;
> }
> {code}
> I think the interface should be kept uniform since rename() is deprecated. 
> For NNThroughputBenchmark, it's easy. But it is not easy to improve 
> MoveCommands, because it involves the transformation of FileSystem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16807) Improve legacy ClientProtocol#rename2() interface

2022-10-19 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16807:
---

 Summary: Improve legacy ClientProtocol#rename2() interface
 Key: HDFS-16807
 URL: https://issues.apache.org/jira/browse/HDFS-16807
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Affects Versions: 3.3.3
Reporter: JiangHua Zhu


In HDFS-2298, rename2() replaced rename(), which is a very meaningful 
improvement. It looks like some old customs are still preserved, they are:
1. When using the shell to execute the mv command, rename() is still used.
./bin/hdfs dfs -mv [source] [target]
{code:java}
In MoveCommands#Rename:
protected void processPath(PathData src, PathData target) throws 
IOException {
  ..
  if (!target.fs.rename(src.path, target.path)) {
// we have no way to know the actual error...
throw new PathIOException(src.toString());
  }
}
{code}

2. When NNThroughputBenchmark verifies the rename.
In NNThroughputBenchmark#RenameFileStats:
{code:java}
long executeOp(int daemonId, int inputIdx, String ignore)
throws IOException {
  long start = Time.now();
  clientProto.rename(fileNames[daemonId][inputIdx],
  destNames[daemonId][inputIdx]);
  long end = Time.now();
  return end-start;
}
{code}

I think the interface should be kept uniform since rename() is deprecated. For 
NNThroughputBenchmark, it's easy. But it is not easy to improve MoveCommands, 
because it involves the transformation of FileSystem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14750) RBF: Improved isolation for downstream name nodes. {Dynamic}

2022-10-18 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619276#comment-17619276
 ] 

JiangHua Zhu commented on HDFS-14750:
-

Thanks [~xuzq_zander] for the work.
I have read your design and have some doubts:
1. Will the penalty time produced during the router run really disappear a lot?
2. Will it affect existing features such as Quota.
3. Overall, RBF still has a lot of room for development, and some compatibility 
needs to be considered.
I have some ideas of my own that I can add to, if I can.
1: Add additional functions to each Router. include:
   1.1. Collect your own processing performance indicators, similar to the role 
of sliding windows.
   1.2. Dynamically set the maximum allowable processing upper limit according 
to the value of the sliding window.
2. Isolate the exception handler and namespace.
Through these processes, the current problems can be effectively alleviated, 
and good compatibility can also be maintained.


> RBF: Improved isolation for downstream name nodes. {Dynamic}
> 
>
> Key: HDFS-14750
> URL: https://issues.apache.org/jira/browse/HDFS-14750
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> This Jira tracks the work around dynamic allocation of resources in routers 
> for downstream hdfs clusters. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16803) Improve some annotations in hdfs module

2022-10-16 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16803:

Description: 
In hdfs module, some annotations are out of date. E.g:
{code:java}
  FSDirRenameOp: 
  /**
   * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath,
   * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
   */
  static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
  String src, String dst, BlocksMapUpdateInfo collectedBlocks,
  boolean logRetryCache,Options.Rename... options)
  throws IOException {
{code}

We should try to improve these annotations to make the documentation look more 
comfortable.

  was:
In FSDirRenameOp, some annotations are out of date. E.g:
{code:java}
  /**
   * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath,
   * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
   */
  static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
  String src, String dst, BlocksMapUpdateInfo collectedBlocks,
  boolean logRetryCache,Options.Rename... options)
  throws IOException {
{code}

We should try to improve these annotations to make the documentation look more 
comfortable.


> Improve some annotations in hdfs module
> ---
>
> Key: HDFS-16803
> URL: https://issues.apache.org/jira/browse/HDFS-16803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, namenode
>Affects Versions: 2.9.2, 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>
> In hdfs module, some annotations are out of date. E.g:
> {code:java}
>   FSDirRenameOp: 
>   /**
>* @see {@link #unprotectedRenameTo(FSDirectory, String, String, 
> INodesInPath,
>* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
>*/
>   static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
>   String src, String dst, BlocksMapUpdateInfo collectedBlocks,
>   boolean logRetryCache,Options.Rename... options)
>   throws IOException {
> {code}
> We should try to improve these annotations to make the documentation look 
> more comfortable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16805) Support set the number of RPC Readers according to different RPC Servers in NameNode

2022-10-16 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618653#comment-17618653
 ] 

JiangHua Zhu commented on HDFS-16805:
-

[~haiyang Hu], here is a similar jira: HDFS-16107


> Support set the number of RPC Readers according to different RPC Servers in 
> NameNode
> 
>
> Key: HDFS-16805
> URL: https://issues.apache.org/jira/browse/HDFS-16805
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>
> Currently, multiple rpc servers are started in namenode, such as client rpc 
> server, service rpc server, lifeline rpc server, and each rpc server used the 
> same parameter 'ipc.server.read.threadpool.size' setting  for the number of 
> reader threads for this server .
> Consider according to different rpc server requirements set the number of 
> reader threads for this server,for example:
> In client RPC server use parameter 'dfs.namenode.reader.count' ,
> In service RPC server use parameter 'dfs.namenode.service.reader.count' ,
> In lifeline RPC server use parameter 'dfs.namenode.lifeline.reader.count'



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16803) Improve some annotations in hdfs module

2022-10-14 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16803:

Summary: Improve some annotations in hdfs module  (was: Improve some 
annotations in FSDirRenameOp)

> Improve some annotations in hdfs module
> ---
>
> Key: HDFS-16803
> URL: https://issues.apache.org/jira/browse/HDFS-16803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, namenode
>Affects Versions: 2.9.2, 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>
> In FSDirRenameOp, some annotations are out of date. E.g:
> {code:java}
>   /**
>* @see {@link #unprotectedRenameTo(FSDirectory, String, String, 
> INodesInPath,
>* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
>*/
>   static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
>   String src, String dst, BlocksMapUpdateInfo collectedBlocks,
>   boolean logRetryCache,Options.Rename... options)
>   throws IOException {
> {code}
> We should try to improve these annotations to make the documentation look 
> more comfortable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16803) Improve some annotations in FSDirRenameOp

2022-10-14 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16803:
---

 Summary: Improve some annotations in FSDirRenameOp
 Key: HDFS-16803
 URL: https://issues.apache.org/jira/browse/HDFS-16803
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation, namenode
Affects Versions: 3.3.4, 2.9.2
Reporter: JiangHua Zhu


In FSDirRenameOp, some annotations are out of date. E.g:
{code:java}
  /**
   * @see {@link #unprotectedRenameTo(FSDirectory, String, String, INodesInPath,
   * INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
   */
  static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
  String src, String dst, BlocksMapUpdateInfo collectedBlocks,
  boolean logRetryCache,Options.Rename... options)
  throws IOException {
{code}

We should try to improve these annotations to make the documentation look more 
comfortable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16803) Improve some annotations in FSDirRenameOp

2022-10-14 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16803:
---

Assignee: JiangHua Zhu

> Improve some annotations in FSDirRenameOp
> -
>
> Key: HDFS-16803
> URL: https://issues.apache.org/jira/browse/HDFS-16803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, namenode
>Affects Versions: 2.9.2, 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> In FSDirRenameOp, some annotations are out of date. E.g:
> {code:java}
>   /**
>* @see {@link #unprotectedRenameTo(FSDirectory, String, String, 
> INodesInPath,
>* INodesInPath, long, BlocksMapUpdateInfo, Options.Rename...)}
>*/
>   static RenameResult renameTo(FSDirectory fsd, FSPermissionChecker pc,
>   String src, String dst, BlocksMapUpdateInfo collectedBlocks,
>   boolean logRetryCache,Options.Rename... options)
>   throws IOException {
> {code}
> We should try to improve these annotations to make the documentation look 
> more comfortable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16802) Print options when accessing ClientProtocol#rename2()

2022-10-13 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17616873#comment-17616873
 ] 

JiangHua Zhu commented on HDFS-16802:
-

New log format:

{code:java}
2022-10-13 12:11:38,813 [Listener at localhost/58086] DEBUG hdfs.StateChange 
(FSDirRenameOp.java:renameToInt(256)) - DIR* NameSystem.renameTo: with options 
- /testNamenodeRetryCache/testRename2/src to /testNamenodeRetryCache 
/testRename2/target, options=[NONE]
{code}


> Print options when accessing ClientProtocol#rename2()
> -
>
> Key: HDFS-16802
> URL: https://issues.apache.org/jira/browse/HDFS-16802
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>
> When accessing ClientProtocol#rename2(), the carried options cannot be seen 
> in the log. Here is some log information:
> {code:java}
> 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG  hdfs.StateChange 
> (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with 
> options - /testNamenodeRetryCache/testRename2/src to 
> /testNamenodeRetryCache/testRename2/target
> {code}
> We should improve this, maybe printing options would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16802) Print options when accessing ClientProtocol#rename2()

2022-10-13 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16802:
---

Assignee: JiangHua Zhu

> Print options when accessing ClientProtocol#rename2()
> -
>
> Key: HDFS-16802
> URL: https://issues.apache.org/jira/browse/HDFS-16802
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>
> When accessing ClientProtocol#rename2(), the carried options cannot be seen 
> in the log. Here is some log information:
> {code:java}
> 2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG  hdfs.StateChange 
> (FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with 
> options - /testNamenodeRetryCache/testRename2/src to 
> /testNamenodeRetryCache/testRename2/target
> {code}
> We should improve this, maybe printing options would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16802) Print options when accessing ClientProtocol#rename2()

2022-10-13 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16802:
---

 Summary: Print options when accessing ClientProtocol#rename2()
 Key: HDFS-16802
 URL: https://issues.apache.org/jira/browse/HDFS-16802
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.4
Reporter: JiangHua Zhu


When accessing ClientProtocol#rename2(), the carried options cannot be seen in 
the log. Here is some log information:
{code:java}
2022-10-13 10:21:10,727 [Listener at localhost/59732] DEBUG  hdfs.StateChange 
(FSDirRenameOp.java:renameToInt(255)) - DIR* NameSystem.renameTo: with options 
- /testNamenodeRetryCache/testRename2/src to 
/testNamenodeRetryCache/testRename2/target
{code}

We should improve this, maybe printing options would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16733) Improve INode#isRoot()

2022-08-18 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16733:

Description: 
When constructing an INodeFile or INodeDirectory, it is usually necessary to 
give a name. For getLocalNameBytes, there are not many restrictions, such as 
null can be set. But an exception is thrown:
{code:java}
INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, 
null, perm, 0L);
{code}

Some exceptions:
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78)

{code}

Although these situations rarely occur in production environments, we should 
refine the implementation of isRoot() to avoid this exception. This can enhance 
system robustness.


  was:
When constructing an INodeFile or INodeDirectory, it is usually necessary to 
give a name. For getLocalNameBytes, there are not many restrictions, such as 
null can be set. But an exception is thrown:
{code:java}
INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, 
null, perm, 0L);
{code}

Some exceptions:
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78)
at 
org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274)
{code}

Although these situations rarely occur in production environments, we should 
refine the implementation of isRoot() to avoid this exception. This can enhance 
system robustness.



> Improve INode#isRoot()
> --
>
> Key: HDFS-16733
> URL: https://issues.apache.org/jira/browse/HDFS-16733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>
> When constructing an INodeFile or INodeDirectory, it is usually necessary to 
> give a name. For getLocalNameBytes, there are not many restrictions, such as 
> null can be set. But an exception is thrown:
> {code:java}
> INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, 
> null, perm, 0L);
> {code}
> Some exceptions:
> {code:java}
> java.lang.NullPointerException
>   at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78)
> {code}
> Although these situations rarely occur in production environments, we should 
> refine the implementation of isRoot() to avoid this exception. This can 
> enhance system robustness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16733) Improve INode#isRoot()

2022-08-18 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16733:
---

Assignee: JiangHua Zhu

> Improve INode#isRoot()
> --
>
> Key: HDFS-16733
> URL: https://issues.apache.org/jira/browse/HDFS-16733
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> When constructing an INodeFile or INodeDirectory, it is usually necessary to 
> give a name. For getLocalNameBytes, there are not many restrictions, such as 
> null can be set. But an exception is thrown:
> {code:java}
> INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, 
> null, perm, 0L);
> {code}
> Some exceptions:
> {code:java}
> java.lang.NullPointerException
>   at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274)
> {code}
> Although these situations rarely occur in production environments, we should 
> refine the implementation of isRoot() to avoid this exception. This can 
> enhance system robustness.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16733) Improve INode#isRoot()

2022-08-18 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16733:
---

 Summary: Improve INode#isRoot()
 Key: HDFS-16733
 URL: https://issues.apache.org/jira/browse/HDFS-16733
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu


When constructing an INodeFile or INodeDirectory, it is usually necessary to 
give a name. For getLocalNameBytes, there are not many restrictions, such as 
null can be set. But an exception is thrown:
{code:java}
INodeDirectory root = new INodeDirectory(HdfsConstants.GRANDFATHER_INODE_ID, 
null, perm, 0L);
{code}

Some exceptions:
{code:java}
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.INode.isRoot(INode.java:78)
at 
org.apache.hadoop.hdfs.server.namenode.TestINodeFile.testIsRoot(TestINodeFile.java:1274)
{code}

Although these situations rarely occur in production environments, we should 
refine the implementation of isRoot() to avoid this exception. This can enhance 
system robustness.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16729) RBF: fix some unreasonably annotated docs

2022-08-16 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16729:

Component/s: documentation

> RBF: fix some unreasonably annotated docs
> -
>
> Key: HDFS-16729
> URL: https://issues.apache.org/jira/browse/HDFS-16729
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, rbf
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-08-16-14-19-07-630.png
>
>
> I found some unreasonably annotated documentation here. E.g:
>  !image-2022-08-16-14-19-07-630.png! 
> It should be our job to make these annotations cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16729) RBF: fix some unreasonably annotated docs

2022-08-16 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16729:
---

Assignee: JiangHua Zhu

> RBF: fix some unreasonably annotated docs
> -
>
> Key: HDFS-16729
> URL: https://issues.apache.org/jira/browse/HDFS-16729
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-08-16-14-19-07-630.png
>
>
> I found some unreasonably annotated documentation here. E.g:
>  !image-2022-08-16-14-19-07-630.png! 
> It should be our job to make these annotations cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16729) RBF: fix some unreasonably annotated docs

2022-08-16 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16729:
---

 Summary: RBF: fix some unreasonably annotated docs
 Key: HDFS-16729
 URL: https://issues.apache.org/jira/browse/HDFS-16729
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: rbf
Affects Versions: 3.3.3
Reporter: JiangHua Zhu
 Attachments: image-2022-08-16-14-19-07-630.png

I found some unreasonably annotated documentation here. E.g:
 !image-2022-08-16-14-19-07-630.png! 

It should be our job to make these annotations cleaner.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log

2022-07-29 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16700 started by JiangHua Zhu.
---
> Record the real client ip carried by the Router in the NameNode log
> ---
>
> Key: HDFS-16700
> URL: https://issues.apache.org/jira/browse/HDFS-16700
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, rbf
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Here are some logs recorded by the NameNode when using RBF:
> {code:java}
> 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001
> {code}
> The ip information here is still the router. If the real client ip is 
> recorded, it will more clearly express where the request comes from.
> E.g:
> {code:java}
> 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001, client=172.111.65.123:43232
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log

2022-07-29 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16700:

Affects Version/s: (was: 3.3.0)

> Record the real client ip carried by the Router in the NameNode log
> ---
>
> Key: HDFS-16700
> URL: https://issues.apache.org/jira/browse/HDFS-16700
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, rbf
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> Here are some logs recorded by the NameNode when using RBF:
> {code:java}
> 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001
> {code}
> The ip information here is still the router. If the real client ip is 
> recorded, it will more clearly express where the request comes from.
> E.g:
> {code:java}
> 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001, client=172.111.65.123:43232
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log

2022-07-29 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16700:

Affects Version/s: 3.3.0

> Record the real client ip carried by the Router in the NameNode log
> ---
>
> Key: HDFS-16700
> URL: https://issues.apache.org/jira/browse/HDFS-16700
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, rbf
>Affects Versions: 3.3.0, 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> Here are some logs recorded by the NameNode when using RBF:
> {code:java}
> 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001
> {code}
> The ip information here is still the router. If the real client ip is 
> recorded, it will more clearly express where the request comes from.
> E.g:
> {code:java}
> 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001, client=172.111.65.123:43232
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log

2022-07-29 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16700:
---

 Summary: Record the real client ip carried by the Router in the 
NameNode log
 Key: HDFS-16700
 URL: https://issues.apache.org/jira/browse/HDFS-16700
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode, rbf
Affects Versions: 3.3.3
Reporter: JiangHua Zhu


Here are some logs recorded by the NameNode when using RBF:
{code:java}
2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
8020, call Call#127 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
172.10.100.67:58001
{code}

The ip information here is still the router. If the real client ip is recorded, 
it will more clearly express where the request comes from.
E.g:
{code:java}
2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
8020, call Call#127 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
172.10.100.67:58001, client=172.111.65.123:43232
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16700) Record the real client ip carried by the Router in the NameNode log

2022-07-29 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16700:
---

Assignee: JiangHua Zhu

> Record the real client ip carried by the Router in the NameNode log
> ---
>
> Key: HDFS-16700
> URL: https://issues.apache.org/jira/browse/HDFS-16700
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, rbf
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> Here are some logs recorded by the NameNode when using RBF:
> {code:java}
> 2022-07-28 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001
> {code}
> The ip information here is still the router. If the real client ip is 
> recorded, it will more clearly express where the request comes from.
> E.g:
> {code:java}
> 2022-07-29 19:31:07,126 INFO ipc.Server: IPC Server handler 8 on default port 
> 8020, call Call#127 Retry#0 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from 
> 172.10.100.67:58001, client=172.111.65.123:43232
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-07-18 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17568284#comment-17568284
 ] 

JiangHua Zhu commented on HDFS-16565:
-

Thanks to [~weichiu] for the suggestion.
I think I will use it.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16650:

Fix Version/s: (was: 2.9.2)
Affects Version/s: 2.9.2

> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> {code:java}
> long now = new Date().getTime();
> {code}
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> {code:java}
> long now = Time.now();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16650:

Description: 
Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();


{code:java}
long now = new Date().getTime();
{code}


This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();

  was:
Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();

This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();


> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> long now = new Date().getTime();
> {code:java}
> long now = new Date().getTime();
> {code}
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16650:

Description: 
Getting timestamps in Centralized cache management is done in the following 
ways:
{code:java}
long now = new Date().getTime();
{code}

This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
{code:java}
long now = Time.now();
{code}


  was:
Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();


{code:java}
long now = new Date().getTime();
{code}


This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();


> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> {code:java}
> long now = new Date().getTime();
> {code}
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> {code:java}
> long now = Time.now();
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16650:

Description: 
Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();

This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();

  was:
Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();
This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();


> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> long now = new Date().getTime();
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16650:

Priority: Minor  (was: Major)

> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> long now = new Date().getTime();
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16650 started by JiangHua Zhu.
---
> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> long now = new Date().getTime();
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16650:
---

 Summary: Optimize the cost of obtaining timestamps in Centralized 
cache management
 Key: HDFS-16650
 URL: https://issues.apache.org/jira/browse/HDFS-16650
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: caching
Reporter: JiangHua Zhu
 Fix For: 2.9.2


Getting timestamps in Centralized cache management is done in the following 
ways:
long now = new Date().getTime();
This approach doesn't seem to be optimal since we only use it once here.
It might be better to use the tool Time to get the timestamp. E.g:
long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16650) Optimize the cost of obtaining timestamps in Centralized cache management

2022-07-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16650:
---

Assignee: JiangHua Zhu

> Optimize the cost of obtaining timestamps in Centralized cache management
> -
>
> Key: HDFS-16650
> URL: https://issues.apache.org/jira/browse/HDFS-16650
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: caching
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Fix For: 2.9.2
>
>
> Getting timestamps in Centralized cache management is done in the following 
> ways:
> long now = new Date().getTime();
> This approach doesn't seem to be optimal since we only use it once here.
> It might be better to use the tool Time to get the timestamp. E.g:
> long now = Time.now();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY

2022-07-01 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16647 started by JiangHua Zhu.
---
> Delete unused NameNode#FS_HDFS_IMPL_KEY
> ---
>
> Key: HDFS-16647
> URL: https://issues.apache.org/jira/browse/HDFS-16647
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in 
> HDFS-15450, and something was removed later in HDFS-15533, but 
> FS_HDFS_IMPL_KEY was kept.
> Here are some discussion details:
> https://github.com/apache/hadoop/pull/2229#discussion_r470935801
> It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY

2022-07-01 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16647:
---

 Summary: Delete unused NameNode#FS_HDFS_IMPL_KEY
 Key: HDFS-16647
 URL: https://issues.apache.org/jira/browse/HDFS-16647
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.3
Reporter: JiangHua Zhu


There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in 
HDFS-15450, and something was removed later in HDFS-15533, but FS_HDFS_IMPL_KEY 
was kept.
Here are some discussion details:
https://github.com/apache/hadoop/pull/2229#discussion_r470935801

It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16647) Delete unused NameNode#FS_HDFS_IMPL_KEY

2022-07-01 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16647:
---

Assignee: JiangHua Zhu

> Delete unused NameNode#FS_HDFS_IMPL_KEY
> ---
>
> Key: HDFS-16647
> URL: https://issues.apache.org/jira/browse/HDFS-16647
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.3
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>
> There's some history here, NameNode#FS_HDFS_IMPL_KEY was introduced in 
> HDFS-15450, and something was removed later in HDFS-15533, but 
> FS_HDFS_IMPL_KEY was kept.
> Here are some discussion details:
> https://github.com/apache/hadoop/pull/2229#discussion_r470935801
> It seems to be cleaner to remove the unused NameNode#FS_HDFS_IMPL_KEY.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15533) Provide DFS API compatible class(ViewDistributedFileSystem), but use ViewFileSystemOverloadScheme inside

2022-06-30 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561289#comment-17561289
 ] 

JiangHua Zhu commented on HDFS-15533:
-

Nice to talk to you, [~umamaheswararao].
It seems that the redundant NameNode#FS_HDFS_IMPL_KEY should be removed here.
If necessary, I will create a new jira to fix it.
Hope to continue to communicate with you.

> Provide DFS API compatible class(ViewDistributedFileSystem), but use 
> ViewFileSystemOverloadScheme inside
> 
>
> Key: HDFS-15533
> URL: https://issues.apache.org/jira/browse/HDFS-15533
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: dfs, viewfs
>Affects Versions: 3.4.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> I have been working on a thought from last week is that, we wanted to provide 
> DFS compatible APIs with mount functionality. So, that existing DFS 
> applications can work with out class cast issues.
> When we tested with other components like Hive and HBase, I noticed some 
> classcast issues.
> {code:java}
> HBase example:
> java.lang.ClassCastException: 
> org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme cannot be cast to 
> org.apache.hadoop.hdfs.DistributedFileSystemjava.lang.ClassCastException: 
> org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme cannot be cast to 
> org.apache.hadoop.hdfs.DistributedFileSystem at 
> org.apache.hadoop.hbase.util.FSUtils.getDFSHedgedReadMetrics(FSUtils.java:1748)
>  at 
> org.apache.hadoop.hbase.regionserver.MetricsRegionServerWrapperImpl.(MetricsRegionServerWrapperImpl.java:146)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1594)
>  at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1001)
>  at java.lang.Thread.run(Thread.java:748){code}
> {code:java}
> Hive:
> |io.AcidUtils|: Failed to get files with ID; using regular API: Only 
> supported for DFS; got class 
> org.apache.hadoop.fs.viewfs.ViewFileSystemOverloadScheme{code}
> SO, here the implementation details are like follows:
> We extended DistributedFileSystem and created a class called " 
> ViewDistributedFileSystem"
>  This vfs=ViewFirstibutedFileSystem, try to initialize 
> ViewFileSystemOverloadScheme. If success call will delegate to  vfs. If fails 
> to initialize due to no mount points, or other errors, it will just fallback 
> to regular DFS init. If users does not configure any mount, system will 
> behave exactly like today's DFS. If there are mount points, vfs functionality 
> will come under DFS.
> I have a patch and will post it in some time.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-20 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556601#comment-17556601
 ] 

JiangHua Zhu commented on HDFS-16637:
-

Thank you for your trust, [~vjasani].
I will be very careful in the future.

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16637) TestHDFSCLI#testAll consistently failing

2022-06-19 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556183#comment-17556183
 ] 

JiangHua Zhu commented on HDFS-16637:
-

Thanks to [~vjasani] for finding this question.
I think it was due to my carelessness.
I'm very sorry.

> TestHDFSCLI#testAll consistently failing
> 
>
> Key: HDFS-16637
> URL: https://issues.apache.org/jira/browse/HDFS-16637
> Project: Hadoop HDFS
>  Issue Type: Test
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The failure seems to have been caused by output change introduced by 
> HDFS-16581.
> {code:java}
> 2022-06-19 15:41:16,183 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(146)) - Detailed results:
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(147)) - 
> --2022-06-19 15:41:16,184 [Listener at 
> localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(156)) - 
> ---
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(157)) -                     Test ID: [629]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(158)) -            Test Description: 
> [printTopology: verifying that the topology map is what we expect]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(159)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(163)) -               Test Commands: [-fs 
> hdfs://localhost:51486 -printTopology]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(167)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(174)) - 
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(178)) -                  Comparator: 
> [RegexpAcrossOutputComparator]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(180)) -          Comparision result:   
> [fail]
> 2022-06-19 15:41:16,184 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(182)) -             Expected output:   
> [^Rack: 
> \/rack1\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)\s*127\.0\.0\.1:\d+\s\([-.a-zA-Z0-9]+\)]
> 2022-06-19 15:41:16,185 [Listener at localhost/51519] INFO  cli.CLITestHelper 
> (CLITestHelper.java:displayResults(184)) -               Actual output:   
> [Rack: /rack1
>    127.0.0.1:51487 (localhost) In Service
>    127.0.0.1:51491 (localhost) In ServiceRack: /rack2
>    127.0.0.1:51500 (localhost) In Service
>    127.0.0.1:51496 (localhost) In Service
>    127.0.0.1:51504 (localhost) In ServiceRack: /rack3
>    127.0.0.1:51508 (localhost) In ServiceRack: /rack4
>    127.0.0.1:51512 (localhost) In Service
>    127.0.0.1:51516 (localhost) In Service]
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-11448) JN log segment syncing should support HA upgrade

2022-06-10 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545848#comment-17545848
 ] 

JiangHua Zhu edited comment on HDFS-11448 at 6/10/22 7:15 AM:
--

Hi [~hanishakoneru], nice to communicate with you.
In JNStorage, getCurrentDir() is not used anywhere.
If you don't mind, I'll remove JNStorage#getCurrentDir() which is not used.


was (Author: jianghuazhu):
Hi [~hanishakoneru], nice to communicate with you.
I found the new addition of JNStorage#getCurrentDir() here, and yes, that's 
good because sd.getCurrentDir() is used in multiple places in the context, but 
there is no use of it anywhere.
If you don't mind, I'll modify this to replace sd.getCurrentDir() with 
JNStorage#getCurrentDir().


> JN log segment syncing should support HA upgrade
> 
>
> Key: HDFS-11448
> URL: https://issues.apache.org/jira/browse/HDFS-11448
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
> Fix For: 3.0.0-alpha4
>
> Attachments: HDFS-11448.001.patch, HDFS-11448.002.patch, 
> HDFS-11448.003.patch
>
>
> HDFS-4025 adds support for sychronizing past log segments to JNs that missed 
> them. But, as pointed out by [~jingzhao], if the segment download happens 
> when an admin tries to rollback, it might fail ([see 
> comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16621) Remove unused JNStorage#getCurrentDir()

2022-06-07 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16621:

Description: There is no use of getCurrentDir() anywhere in JNStorage, we 
should remove it.  (was: In JNStorage, sd.getCurrentDir() is used in 5~6 places,
It can be replaced with JNStorage#getCurrentDir(), which will be more concise.)

> Remove unused JNStorage#getCurrentDir()
> ---
>
> Key: HDFS-16621
> URL: https://issues.apache.org/jira/browse/HDFS-16621
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node, qjm
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> There is no use of getCurrentDir() anywhere in JNStorage, we should remove it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16621) Remove unused JNStorage#getCurrentDir()

2022-06-07 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16621:

Summary: Remove unused JNStorage#getCurrentDir()  (was: Replace 
sd.getCurrentDir() with JNStorage#getCurrentDir())

> Remove unused JNStorage#getCurrentDir()
> ---
>
> Key: HDFS-16621
> URL: https://issues.apache.org/jira/browse/HDFS-16621
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node, qjm
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In JNStorage, sd.getCurrentDir() is used in 5~6 places,
> It can be replaced with JNStorage#getCurrentDir(), which will be more concise.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()

2022-06-05 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16621 started by JiangHua Zhu.
---
> Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
> -
>
> Key: HDFS-16621
> URL: https://issues.apache.org/jira/browse/HDFS-16621
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node, qjm
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In JNStorage, sd.getCurrentDir() is used in 5~6 places,
> It can be replaced with JNStorage#getCurrentDir(), which will be more concise.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()

2022-06-05 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16621:
---

Assignee: JiangHua Zhu

> Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
> -
>
> Key: HDFS-16621
> URL: https://issues.apache.org/jira/browse/HDFS-16621
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: journal-node, qjm
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>
> In JNStorage, sd.getCurrentDir() is used in 5~6 places,
> It can be replaced with JNStorage#getCurrentDir(), which will be more concise.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16621) Replace sd.getCurrentDir() with JNStorage#getCurrentDir()

2022-06-05 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16621:
---

 Summary: Replace sd.getCurrentDir() with JNStorage#getCurrentDir()
 Key: HDFS-16621
 URL: https://issues.apache.org/jira/browse/HDFS-16621
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: journal-node, qjm
Affects Versions: 3.3.0
Reporter: JiangHua Zhu


In JNStorage, sd.getCurrentDir() is used in 5~6 places,
It can be replaced with JNStorage#getCurrentDir(), which will be more concise.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11448) JN log segment syncing should support HA upgrade

2022-06-03 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-11448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545848#comment-17545848
 ] 

JiangHua Zhu commented on HDFS-11448:
-

Hi [~hanishakoneru], nice to communicate with you.
I found the new addition of JNStorage#getCurrentDir() here, and yes, that's 
good because sd.getCurrentDir() is used in multiple places in the context, but 
there is no use of it anywhere.
If you don't mind, I'll modify this to replace sd.getCurrentDir() with 
JNStorage#getCurrentDir().


> JN log segment syncing should support HA upgrade
> 
>
> Key: HDFS-11448
> URL: https://issues.apache.org/jira/browse/HDFS-11448
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
> Fix For: 3.0.0-alpha4
>
> Attachments: HDFS-11448.001.patch, HDFS-11448.002.patch, 
> HDFS-11448.003.patch
>
>
> HDFS-4025 adds support for sychronizing past log segments to JNs that missed 
> them. But, as pointed out by [~jingzhao], if the segment download happens 
> when an admin tries to rollback, it might fail ([see 
> comment|https://issues.apache.org/jira/browse/HDFS-4025?focusedCommentId=15850633=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15850633]).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16614) Improve balancer operation strategy and performance

2022-06-01 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16614:

Description: 
When the Balancer program is run, it does some work in the following order:
1. Obtain available datanode information from NameNode.
2. Classify and calculate the average utilization according to StorageType. 
Here, some sets will be obtained in combination with the set thresholds: 
overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
3. According to some calculations, the source and target related to the 
transfer data are obtained. The source is used for the source end, and the 
target is used for the data receiving end.
4. Start the data transfer work in parallel.
In this process, run iteratively. In this process, the threshold is unified and 
applied to all StorageTypes, which seems to be a bit rough, because one of the 
StorageTypes cannot be distinguished, which is based on the currently supported 
heterogeneous storage.

There is an online cluster with more than 2000 nodes, and there is an imbalance 
in node storage. E.g:
 !image-2022-06-02-13-18-33-213.png! 

Here, the average utilization of the cluster is 78%, but the utilization of 
most nodes is between 85% and 90%. When the balancer is turned on, we find that 
85% of the nodes are working as sources. In this case, we think it is not 
reasonable, because it will occupy more network resources in the cluster, and 
it will be beneficial to the normal work of the cluster to do some effective 
restrictions.
So here are some changes to make:
1. When the balancer is running, we should actively prompt the suggested value 
of the threshold related to StorageType. For example: [[DISK, 10%], [SSD, 
8%]...]
2. Support to set threshold according to StorageType and work.
3. Add an option to prohibit nodes below the threshold from joining the Source 
set. This is to allow nodes with high utilization to transfer data as soon as 
possible, which is good for balance.
4. Add new support. If there are a lot of datanode usage in the cluster, it 
should remain unchanged. For example, the utilization rate of 40% of the nodes 
in the cluster is 75% to 80%, and these nodes should not join the Source set. 
Of course this support needs to be specified by the user at runtime.

  was:
When the Balancer program is run, it does some work in the following order:
1. Obtain available datanode information from NameNode.
2. Classify and calculate the average utilization according to StorageType. 
Here, some sets will be obtained in combination with the set thresholds: 
overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
3. According to some calculations, the source and target related to the 
transfer data are obtained. The source is used for the source end, and the 
target is used for the data receiving end.
4. Start the data transfer work in parallel.
In this process, run iteratively. In this process, the threshold is unified and 
applied to all StorageTypes, which seems to be a bit rough, because one of the 
StorageTypes cannot be distinguished, which is based on the currently supported 
heterogeneous storage.

There is an online cluster with more than 2000 nodes, and there is an imbalance 
in node storage. E.g:
 !image-2022-06-02-13-18-33-213.png! 

Here, the average utilization of the cluster is 78%, but the utilization of 
most nodes is between 85% and 90%. When the balancer is turned on, we find that 
85% of the nodes are working as sources. In this case, we think it is not 
reasonable, because it will occupy more network resources in the cluster, and 
it will be beneficial to the normal work of the cluster to do some effective 
restrictions.
So here are some changes to make:
1. When the balancer is running, it should try to prompt the threshold related 
to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
2. Support to set threshold according to StorageType and work.
3. Add an option to prohibit nodes below the threshold from joining the Source 
set. This is to allow nodes with high utilization to transfer data as soon as 
possible, which is good for balance.
4. Add new support. If there are a lot of datanode usage in the cluster, it 
should remain unchanged. For example, the utilization rate of 40% of the nodes 
in the cluster is 75% to 80%, and these nodes should not join the Source set. 
Of course this support needs to be specified by the user at runtime.


> Improve balancer operation strategy and performance
> ---
>
> Key: HDFS-16614
> URL: https://issues.apache.org/jira/browse/HDFS-16614
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> 

[jira] [Updated] (HDFS-16614) Improve balancer operation strategy and performance

2022-06-01 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16614:

Affects Version/s: 2.9.2
   (was: 3.3.0)

> Improve balancer operation strategy and performance
> ---
>
> Key: HDFS-16614
> URL: https://issues.apache.org/jira/browse/HDFS-16614
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-06-02-13-18-33-213.png
>
>
> When the Balancer program is run, it does some work in the following order:
> 1. Obtain available datanode information from NameNode.
> 2. Classify and calculate the average utilization according to StorageType. 
> Here, some sets will be obtained in combination with the set thresholds: 
> overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
> 3. According to some calculations, the source and target related to the 
> transfer data are obtained. The source is used for the source end, and the 
> target is used for the data receiving end.
> 4. Start the data transfer work in parallel.
> In this process, run iteratively. In this process, the threshold is unified 
> and applied to all StorageTypes, which seems to be a bit rough, because one 
> of the StorageTypes cannot be distinguished, which is based on the currently 
> supported heterogeneous storage.
> There is an online cluster with more than 2000 nodes, and there is an 
> imbalance in node storage. E.g:
>  !image-2022-06-02-13-18-33-213.png! 
> Here, the average utilization of the cluster is 78%, but the utilization of 
> most nodes is between 85% and 90%. When the balancer is turned on, we find 
> that 85% of the nodes are working as sources. In this case, we think it is 
> not reasonable, because it will occupy more network resources in the cluster, 
> and it will be beneficial to the normal work of the cluster to do some 
> effective restrictions.
> So here are some changes to make:
> 1. When the balancer is running, it should try to prompt the threshold 
> related to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
> 2. Support to set threshold according to StorageType and work.
> 3. Add an option to prohibit nodes below the threshold from joining the 
> Source set. This is to allow nodes with high utilization to transfer data as 
> soon as possible, which is good for balance.
> 4. Add new support. If there are a lot of datanode usage in the cluster, it 
> should remain unchanged. For example, the utilization rate of 40% of the 
> nodes in the cluster is 75% to 80%, and these nodes should not join the 
> Source set. Of course this support needs to be specified by the user at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16614) Improve balancer operation strategy and performance

2022-06-01 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16614:
---

 Summary: Improve balancer operation strategy and performance
 Key: HDFS-16614
 URL: https://issues.apache.org/jira/browse/HDFS-16614
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu
 Attachments: image-2022-06-02-13-18-33-213.png

When the Balancer program is run, it does some work in the following order:
1. Obtain available datanode information from NameNode.
2. Classify and calculate the average utilization according to StorageType. 
Here, some sets will be obtained in combination with the set thresholds: 
overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
3. According to some calculations, the source and target related to the 
transfer data are obtained. The source is used for the source end, and the 
target is used for the data receiving end.
4. Start the data transfer work in parallel.
In this process, run iteratively. In this process, the threshold is unified and 
applied to all StorageTypes, which seems to be a bit rough, because one of the 
StorageTypes cannot be distinguished, which is based on the currently supported 
heterogeneous storage.

There is an online cluster with more than 2000 nodes, and there is an imbalance 
in node storage. E.g:
 !image-2022-06-02-13-18-33-213.png! 

Here, the average utilization of the cluster is 78%, but the utilization of 
most nodes is between 85% and 90%. When the balancer is turned on, we find that 
85% of the nodes are working as sources. In this case, we think it is not 
reasonable, because it will occupy more network resources in the cluster, and 
it will be beneficial to the normal work of the cluster to do some effective 
restrictions.
So here are some changes to make:
1. When the balancer is running, it should try to prompt the threshold related 
to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
2. Support to set threshold according to StorageType and work.
3. Add an option to prohibit nodes below the threshold from joining the Source 
set. This is to allow nodes with high utilization to transfer data as soon as 
possible, which is good for balance.
4. Add new support. If there are a lot of datanode usage in the cluster, it 
should remain unchanged. For example, the utilization rate of 40% of the nodes 
in the cluster is 75% to 80%, and these nodes should not join the Source set. 
Of course this support needs to be specified by the user at runtime.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16614) Improve balancer operation strategy and performance

2022-06-01 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16614:
---

Assignee: JiangHua Zhu

> Improve balancer operation strategy and performance
> ---
>
> Key: HDFS-16614
> URL: https://issues.apache.org/jira/browse/HDFS-16614
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-06-02-13-18-33-213.png
>
>
> When the Balancer program is run, it does some work in the following order:
> 1. Obtain available datanode information from NameNode.
> 2. Classify and calculate the average utilization according to StorageType. 
> Here, some sets will be obtained in combination with the set thresholds: 
> overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
> 3. According to some calculations, the source and target related to the 
> transfer data are obtained. The source is used for the source end, and the 
> target is used for the data receiving end.
> 4. Start the data transfer work in parallel.
> In this process, run iteratively. In this process, the threshold is unified 
> and applied to all StorageTypes, which seems to be a bit rough, because one 
> of the StorageTypes cannot be distinguished, which is based on the currently 
> supported heterogeneous storage.
> There is an online cluster with more than 2000 nodes, and there is an 
> imbalance in node storage. E.g:
>  !image-2022-06-02-13-18-33-213.png! 
> Here, the average utilization of the cluster is 78%, but the utilization of 
> most nodes is between 85% and 90%. When the balancer is turned on, we find 
> that 85% of the nodes are working as sources. In this case, we think it is 
> not reasonable, because it will occupy more network resources in the cluster, 
> and it will be beneficial to the normal work of the cluster to do some 
> effective restrictions.
> So here are some changes to make:
> 1. When the balancer is running, it should try to prompt the threshold 
> related to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
> 2. Support to set threshold according to StorageType and work.
> 3. Add an option to prohibit nodes below the threshold from joining the 
> Source set. This is to allow nodes with high utilization to transfer data as 
> soon as possible, which is good for balance.
> 4. Add new support. If there are a lot of datanode usage in the cluster, it 
> should remain unchanged. For example, the utilization rate of 40% of the 
> nodes in the cluster is 75% to 80%, and these nodes should not join the 
> Source set. Of course this support needs to be specified by the user at 
> runtime.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542370#comment-17542370
 ] 

JiangHua Zhu commented on HDFS-16594:
-

Thanks [~sodonnell] and [~weichiu] for your comments and following.

> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> ...:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu resolved HDFS-16592.
-
Resolution: Not A Problem

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16594:

Description: 
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
...:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.

  was:
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.145.92:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.99.152:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.146.56:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.82.106:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 

[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542190#comment-17542190
 ] 

JiangHua Zhu commented on HDFS-16594:
-

In my opinion, the priority of RpcCall being processed on time is relatively 
high, and the time that DatanodeAdminDefaultMonitor holds FSNamesystemLock 
cannot be too long.
Here are 2 ways to optimize:
1. The default value of ${dfs.namenode.decommission.blocks.per.interval} can be 
lowered, such as 1 or 2.
2. When DatanodeAdminDefaultMonitor is working, increase the time slice 
processing. For example, when DatanodeAdminDefaultMonitor works for more than 
500ms, it is forced to sleep for 10ms, and then restarts.

We can choose one of these 2 methods.
[~weichiu]  [~ayushtkn], do you guys have some good suggestions?


> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16594:
---

Assignee: JiangHua Zhu

> Many RpcCalls are blocked for a while while Decommission works
> --
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.9.2
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

2022-05-25 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16594:
---

 Summary: Many RpcCalls are blocked for a while while Decommission 
works
 Key: HDFS-16594
 URL: https://issues.apache.org/jira/browse/HDFS-16594
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.9.2
Reporter: JiangHua Zhu
 Attachments: image-2022-05-26-02-05-38-878.png

When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.145.92:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.99.152:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.146.56:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.82.106:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.149.175:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.81.46:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-24 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16592:

Component/s: documentation

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, documentation, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-23 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16592 started by JiangHua Zhu.
---
> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-05-24-11-29-14-019.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-23 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16592:
---

Assignee: JiangHua Zhu

> Fix typo for BalancingPolicy
> 
>
> Key: HDFS-16592
> URL: https://issues.apache.org/jira/browse/HDFS-16592
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Minor
> Attachments: image-2022-05-24-11-29-14-019.png
>
>
>  !image-2022-05-24-11-29-14-019.png! 
> 'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16592) Fix typo for BalancingPolicy

2022-05-23 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16592:
---

 Summary: Fix typo for BalancingPolicy
 Key: HDFS-16592
 URL: https://issues.apache.org/jira/browse/HDFS-16592
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Affects Versions: 3.4.0
Reporter: JiangHua Zhu
 Attachments: image-2022-05-24-11-29-14-019.png

 !image-2022-05-24-11-29-14-019.png! 

'NOT' should be changed to lowercase rather than uppercase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16581) Print node status when executing printTopology

2022-05-23 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16581:

Summary: Print node status when executing printTopology  (was: Print 
DataNode node status)

> Print node status when executing printTopology
> --
>
> Key: HDFS-16581
> URL: https://issues.apache.org/jira/browse/HDFS-16581
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsadmin, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We can use the dfsadmin tool to see which DataNodes the cluster has, and some 
> of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It 
> would be helpful if we could get this information in a timely manner, such as 
> troubleshooting cluster failures, tracking node status, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches

2022-05-23 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16588 started by JiangHua Zhu.
---
> Backport HDFS-16584 to branch-3.3 and other active old branches
> ---
>
> Key: HDFS-16588
> URL: https://issues.apache.org/jira/browse/HDFS-16588
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue has been dealt with in trunk and again needs to be backported to 
> branch-3.3 or another active branch.
> See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches

2022-05-22 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16588:
---

Assignee: JiangHua Zhu

> Backport HDFS-16584 to branch-3.3 and other active old branches
> ---
>
> Key: HDFS-16588
> URL: https://issues.apache.org/jira/browse/HDFS-16588
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> This issue has been dealt with in trunk and again needs to be backported to 
> branch-3.3 or another active branch.
> See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16588) Backport HDFS-16584 to branch-3.3 and other active old branches

2022-05-22 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16588:
---

 Summary: Backport HDFS-16584 to branch-3.3 and other active old 
branches
 Key: HDFS-16588
 URL: https://issues.apache.org/jira/browse/HDFS-16588
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Reporter: JiangHua Zhu


This issue has been dealt with in trunk and again needs to be backported to 
branch-3.3 or another active branch.
See HDFS-16584.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16584) Record StandbyNameNode information when Balancer is running

2022-05-19 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16584 started by JiangHua Zhu.
---
> Record StandbyNameNode information when Balancer is running
> ---
>
> Key: HDFS-16584
> URL: https://issues.apache.org/jira/browse/HDFS-16584
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-05-19-20-23-23-825.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the Balancer is running, we allow block data to be fetched from the 
> StandbyNameNode, which is nice. Here are some logs:
>  !image-2022-05-19-20-23-23-825.png! 
> But we have no way of knowing which NameNode the request was made to. We 
> should log more detailed information, such as the host associated with the 
> StandbyNameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16584) Record StandbyNameNode information when Balancer is running

2022-05-19 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16584:
---

Assignee: JiangHua Zhu

> Record StandbyNameNode information when Balancer is running
> ---
>
> Key: HDFS-16584
> URL: https://issues.apache.org/jira/browse/HDFS-16584
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer  mover, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: image-2022-05-19-20-23-23-825.png
>
>
> When the Balancer is running, we allow block data to be fetched from the 
> StandbyNameNode, which is nice. Here are some logs:
>  !image-2022-05-19-20-23-23-825.png! 
> But we have no way of knowing which NameNode the request was made to. We 
> should log more detailed information, such as the host associated with the 
> StandbyNameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16584) Record StandbyNameNode information when Balancer is running

2022-05-19 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16584:
---

 Summary: Record StandbyNameNode information when Balancer is 
running
 Key: HDFS-16584
 URL: https://issues.apache.org/jira/browse/HDFS-16584
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu
 Attachments: image-2022-05-19-20-23-23-825.png

When the Balancer is running, we allow block data to be fetched from the 
StandbyNameNode, which is nice. Here are some logs:
 !image-2022-05-19-20-23-23-825.png! 

But we have no way of knowing which NameNode the request was made to. We should 
log more detailed information, such as the host associated with the 
StandbyNameNode.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16581) Print DataNode node status

2022-05-17 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16581 started by JiangHua Zhu.
---
> Print DataNode node status
> --
>
> Key: HDFS-16581
> URL: https://issues.apache.org/jira/browse/HDFS-16581
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsadmin, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> We can use the dfsadmin tool to see which DataNodes the cluster has, and some 
> of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It 
> would be helpful if we could get this information in a timely manner, such as 
> troubleshooting cluster failures, tracking node status, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-17 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu resolved HDFS-16565.
-
Resolution: Duplicate

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16581) Print DataNode node status

2022-05-17 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16581:
---

Assignee: JiangHua Zhu

> Print DataNode node status
> --
>
> Key: HDFS-16581
> URL: https://issues.apache.org/jira/browse/HDFS-16581
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsadmin, namenode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> We can use the dfsadmin tool to see which DataNodes the cluster has, and some 
> of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It 
> would be helpful if we could get this information in a timely manner, such as 
> troubleshooting cluster failures, tracking node status, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16581) Print DataNode node status

2022-05-17 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16581:
---

 Summary: Print DataNode node status
 Key: HDFS-16581
 URL: https://issues.apache.org/jira/browse/HDFS-16581
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsadmin, namenode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu


We can use the dfsadmin tool to see which DataNodes the cluster has, and some 
of these nodes are alive, DECOMMISSIONED, or DECOMMISSION_INPROGRESS. It would 
be helpful if we could get this information in a timely manner, such as 
troubleshooting cluster failures, tracking node status, etc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16576) Remove unused Imports in Hadoop HDFS project

2022-05-11 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535817#comment-17535817
 ] 

JiangHua Zhu commented on HDFS-16576:
-

It looks like what is described here is somewhat simple.

> Remove unused Imports in Hadoop HDFS project
> 
>
> Key: HDFS-16576
> URL: https://issues.apache.org/jira/browse/HDFS-16576
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Minor
>
> h3. Optimize Imports to keep code clean
>  # Remove any unused imports



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-08 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Component/s: ec

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-08 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17533457#comment-17533457
 ] 

JiangHua Zhu commented on HDFS-16565:
-

A problem with socket leaks caused by StripedBlockChecksumReconstructor was 
found here.
Here are some logs from the online cluster:
2022-05-07 13:01:46,798 WARN 
org.apache.hadoop.hdfs.server.datanode.BlockChecksumHelper: Exception while 
reading checksum
java.net.SocketTimeoutException: 3000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.198.108.108:17834 remote=/10.198.109.181:1004]

Here is the source code for version 3.3.x, in BlockChecksumHelper:
 !screenshot-2.png! 

This issue has been reported in HDFS-15709.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-08 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Attachment: screenshot-2.png

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532808#comment-17532808
 ] 

JiangHua Zhu edited comment on HDFS-16565 at 5/6/22 11:46 AM:
--

Thanks [~hexiaoqiao] for the comment.
This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 
3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is 
the phenomenon on 1 of the DataNodes:
jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP 
hadoop-ec482.xxx:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP 
hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP 
hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP 
hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT)

Here, hadoop-ec482.xxx is the local DataNode node. You can see that when 
connecting to other nodes, a random port is used, but eventually the connection 
here will remain for a long time and will not be released. I guess the problem 
is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket 
properly.
On our cluster, there are 3 ways to use it:
1. Use HDFS Client Api to store EC data.
2. The data is copied or transferred when the DataNode is forced to go offline, 
or when the balancer is executed.
3. A small amount of storage multi-copy data occurs.
I'm still investigating the exact cause of what's happening here.
[~hexiaoqiao], do you have some better suggestions.
Thank you very much.


was (Author: jianghuazhu):
Thanks [~hexiaoqiao] for the comment.
This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 
3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is 
the phenomenon on 1 of the DataNodes:
jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP 
hadoop-ec482.xxx.org:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP 
hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP 
hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP 
hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT)

Here, hadoop-ec482.xxx is the local DataNode node. You can see that when 
connecting to other nodes, a random port is used, but eventually the connection 
here will remain for a long time and will not be released. I guess the problem 
is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket 
properly.
On our cluster, there are 3 ways to use it:
1. Use HDFS Client Api to store EC data.
2. The data is copied or transferred when the DataNode is forced to go offline, 
or when the balancer is executed.
3. A small amount of storage multi-copy data occurs.
I'm still investigating the exact cause of what's happening here.
[~hexiaoqiao], do you have some better suggestions.
Thank you very much.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532808#comment-17532808
 ] 

JiangHua Zhu commented on HDFS-16565:
-

Thanks [~hexiaoqiao] for the comment.
This phenomenon occurs on all DataNodes in our online cluster, we use hadoop 
3.3.x, and these clusters are mainly used to store EC data (RS 6x3). Here is 
the phenomenon on 1 of the DataNodes:
jsvc 198492 hdfs *100u IPv4 2393306999 0t0 TCP 
hadoop-ec482.xxx.org:45344->hadoop-ec505.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *117u IPv4 2541480174 0t0 TCP 
hadoop-ec482.xxx:53954->hadoop-ec495.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *123u IPv4 2542535148 0t0 TCP 
hadoop-ec482.xxx:39860->hadoop-ec564.xxx:1004 (CLOSE_WAIT)
jsvc 198492 hdfs *125u IPv4 2543324650 0t0 TCP 
hadoop-ec482.xxx:42518->hadoop-ec490.xxx:1004 (CLOSE_WAIT)

Here, hadoop-ec482.xxx is the local DataNode node. You can see that when 
connecting to other nodes, a random port is used, but eventually the connection 
here will remain for a long time and will not be released. I guess the problem 
is in nodes like hadoop-ec482.xxx, due to not closing the stream or socket 
properly.
On our cluster, there are 3 ways to use it:
1. Use HDFS Client Api to store EC data.
2. The data is copied or transferred when the DataNode is forced to go offline, 
or when the balancer is executed.
3. A small amount of storage multi-copy data occurs.
I'm still investigating the exact cause of what's happening here.
[~hexiaoqiao], do you have some better suggestions.
Thank you very much.

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Issue Type: Bug  (was: Improvement)

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Description: 
There is a strange phenomenon here, DataNode holds a large number of 
connections in CLOSE_WAIT state and does not release.
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 20
CLOSE_WAIT 17707
ESTABLISHED 1450
TIME_WAIT 12

It can be found that the connections with the CLOSE_WAIT state have reached 17k 
and are still growing. View these CLOSE_WAITs through the lsof command, and get 
the following phenomenon:
lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
 !screenshot-1.png! 

It can be seen that the reason for this phenomenon is that Socket#close() is 
not called correctly, and DataNode interacts with other nodes as Client.

  was:
There is a strange phenomenon here, DataNode holds a large number of 
connections in CLOSE_WAIT state and does not release.
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 20
CLOSE_WAIT 17707
ESTABLISHED 1450
TIME_WAIT 12
It can be found that the connections with the CLOSE_WAIT state have reached 17k 
and are still growing. View these CLOSE_WAITs through the lsof command, and get 
the following phenomenon:
lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
 !screenshot-1.png! 

It can be seen that the reason for this phenomenon is that Socket#close() is 
not called correctly, and DataNode interacts with other nodes as Client.


> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Environment: CentOS Linux release 7.5.1804 (Core)

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
> Environment: CentOS Linux release 7.5.1804 (Core)
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16565:
---

Assignee: (was: JiangHua Zhu)

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Attachment: screenshot-1.png

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When DataTransfer runs, the local node needs to connect to another DataNode, 
> which is through socket. Once the connection fails, a NoRouteToHostException 
> will be generated.
> Exception information:
> 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(...:1004, 
> datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
> ipcPort=8025, 
> storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
>  to transfer BP-1375239094-...- 
> 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
> java.net.NoRouteToHostException: No route to host
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> The source of the accident:
> sock = newSocket();
> NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
> sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
> sock.setSoTimeout(targets.length * dnConf.socketTimeout);
> When a NoRouteToHostException occurs, the Block will be added to the 
> VolumeScanner, and the VolumeScanner will start working to scan the Block. 
> This should not happen because this is not a real IOException.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Description: 
There is a strange phenomenon here, DataNode holds a large number of 
connections in CLOSE_WAIT state and does not release.
netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
LISTEN 20
CLOSE_WAIT 17707
ESTABLISHED 1450
TIME_WAIT 12
It can be found that the connections with the CLOSE_WAIT state have reached 17k 
and are still growing. View these CLOSE_WAITs through the lsof command, and get 
the following phenomenon:
lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
 !screenshot-1.png! 

It can be seen that the reason for this phenomenon is that Socket#close() is 
not called correctly, and DataNode interacts with other nodes as Client.

  was:
When DataTransfer runs, the local node needs to connect to another DataNode, 
which is through socket. Once the connection fails, a NoRouteToHostException 
will be generated.
Exception information:
2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(...:1004, 
datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
ipcPort=8025, 
storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
 to transfer BP-1375239094-...- 
1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The source of the accident:
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
sock.setSoTimeout(targets.length * dnConf.socketTimeout);

When a NoRouteToHostException occurs, the Block will be added to the 
VolumeScanner, and the VolumeScanner will start working to scan the Block. This 
should not happen because this is not a real IOException.



> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> There is a strange phenomenon here, DataNode holds a large number of 
> connections in CLOSE_WAIT state and does not release.
> netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
> LISTEN 20
> CLOSE_WAIT 17707
> ESTABLISHED 1450
> TIME_WAIT 12
> It can be found that the connections with the CLOSE_WAIT state have reached 
> 17k and are still growing. View these CLOSE_WAITs through the lsof command, 
> and get the following phenomenon:
> lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'
>  !screenshot-1.png! 
> It can be seen that the reason for this phenomenon is that Socket#close() is 
> not called correctly, and DataNode interacts with other nodes as Client.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) DataNode holds a large number of CLOSE_WAIT connections that are not released

2022-05-06 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Summary: DataNode holds a large number of CLOSE_WAIT connections that are 
not released  (was: Optimize DataNode#DataTransfer, when encountering 
NoRouteToHostException)

> DataNode holds a large number of CLOSE_WAIT connections that are not released
> -
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> When DataTransfer runs, the local node needs to connect to another DataNode, 
> which is through socket. Once the connection fails, a NoRouteToHostException 
> will be generated.
> Exception information:
> 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(...:1004, 
> datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
> ipcPort=8025, 
> storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
>  to transfer BP-1375239094-...- 
> 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
> java.net.NoRouteToHostException: No route to host
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> The source of the accident:
> sock = newSocket();
> NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
> sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
> sock.setSoTimeout(targets.length * dnConf.socketTimeout);
> When a NoRouteToHostException occurs, the Block will be added to the 
> VolumeScanner, and the VolumeScanner will start working to scan the Block. 
> This should not happen because this is not a real IOException.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException

2022-04-30 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16565:

Description: 
When DataTransfer runs, the local node needs to connect to another DataNode, 
which is through socket. Once the connection fails, a NoRouteToHostException 
will be generated.
Exception information:
2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(...:1004, 
datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
ipcPort=8025, 
storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
 to transfer BP-1375239094-...- 
1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The source of the accident:
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
sock.setSoTimeout(targets.length * dnConf.socketTimeout);

When a NoRouteToHostException occurs, the Block will be added to the 
VolumeScanner, and the VolumeScanner will start working to scan the Block. This 
should not happen because this is not a real IOException.


  was:
When DataTransfer runs, the local node needs to connect to another DataNode, 
which is through socket. Once the connection fails, a NoRouteToHostException 
will be generated.
Exception information:
2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(...:1004, 
datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
ipcPort=8025, 
storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
 to transfer BP-1375239094-...- 
1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The source of the accident:
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
sock.setSoTimeout(targets.length * dnConf.socketTimeout);

When a NoRouteToHostException occurs, the Block will be added to the 
VolumeScanner, and the VolumeScanner will start working to scan the Block. This 
should not happen because this is not a real IOException.
catch (IOException ie) {
handleBadBlock(b, ie, false);
LOG.warn("{}:Failed to transfer {} to {} got",
bpReg, b, targets[0], ie);
  }



> Optimize DataNode#DataTransfer, when encountering NoRouteToHostException
> 
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> When DataTransfer runs, the local node needs to connect to another DataNode, 
> which is through socket. Once the connection fails, a NoRouteToHostException 
> will be generated.
> Exception information:
> 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(...:1004, 
> datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
> ipcPort=8025, 
> storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
>  to transfer BP-1375239094-...- 
> 

[jira] [Assigned] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException

2022-04-29 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16565:
---

Assignee: JiangHua Zhu

> Optimize DataNode#DataTransfer, when encountering NoRouteToHostException
> 
>
> Key: HDFS-16565
> URL: https://issues.apache.org/jira/browse/HDFS-16565
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> When DataTransfer runs, the local node needs to connect to another DataNode, 
> which is through socket. Once the connection fails, a NoRouteToHostException 
> will be generated.
> Exception information:
> 2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeRegistration(...:1004, 
> datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
> ipcPort=8025, 
> storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
>  to transfer BP-1375239094-...- 
> 1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
> java.net.NoRouteToHostException: No route to host
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> at 
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> The source of the accident:
> sock = newSocket();
> NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
> sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
> sock.setSoTimeout(targets.length * dnConf.socketTimeout);
> When a NoRouteToHostException occurs, the Block will be added to the 
> VolumeScanner, and the VolumeScanner will start working to scan the Block. 
> This should not happen because this is not a real IOException.
> catch (IOException ie) {
> handleBadBlock(b, ie, false);
> LOG.warn("{}:Failed to transfer {} to {} got",
> bpReg, b, targets[0], ie);
>   }



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16565) Optimize DataNode#DataTransfer, when encountering NoRouteToHostException

2022-04-29 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16565:
---

 Summary: Optimize DataNode#DataTransfer, when encountering 
NoRouteToHostException
 Key: HDFS-16565
 URL: https://issues.apache.org/jira/browse/HDFS-16565
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu


When DataTransfer runs, the local node needs to connect to another DataNode, 
which is through socket. Once the connection fails, a NoRouteToHostException 
will be generated.
Exception information:
2022-04-29 15:47:47,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(...:1004, 
datanodeUuid=..., infoPort=1006 , infoSecurePort=0, 
ipcPort=8025, 
storageInfo=lv=-57;cid=...;nsid=961284063;c=1589290804417):Failed
 to transfer BP-1375239094-...- 
1589290804417:blk_-9223372035798255743_66037710 to ..xxx.:1004 got
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:533)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:497)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2562)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

The source of the accident:
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout); 
sock.setTcpNoDelay(dnConf.getDataTransferServerTcpNoDelay());
sock.setSoTimeout(targets.length * dnConf.socketTimeout);

When a NoRouteToHostException occurs, the Block will be added to the 
VolumeScanner, and the VolumeScanner will start working to scan the Block. This 
should not happen because this is not a real IOException.
catch (IOException ie) {
handleBadBlock(b, ie, false);
LOG.warn("{}:Failed to transfer {} to {} got",
bpReg, b, targets[0], ie);
  }




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread JiangHua Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17503966#comment-17503966
 ] 

JiangHua Zhu commented on HDFS-16498:
-

This seems to be a robustness issue with the NameNode.
A normal BlockReport workflow:
1. The DataNode registers itself with the NameNode.
2. The DataNode sends a reporting request to the NameNode.
3.NameNode starts working for BlockReport.

It looks like this happens when the NameNode and DataNode restart at the same 
time.
When this happens, should the log prompt become warn or Info more appropriate.
This is just my thought.
 !screenshot-1.png! 




> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16498) Fix NPE for checkBlockReportLease

2022-03-09 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu updated HDFS-16498:

Attachment: screenshot-1.png

> Fix NPE for checkBlockReportLease
> -
>
> Key: HDFS-16498
> URL: https://issues.apache.org/jira/browse/HDFS-16498
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-03-09-20-35-22-028.png, screenshot-1.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> During the restart of Namenode, a Datanode is not registered, but this 
> Datanode triggers FBR, which causes NPE.
> !image-2022-03-09-20-35-22-028.png|width=871,height=158!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()

2022-03-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16494 started by JiangHua Zhu.
---
> Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
> ---
>
> Key: HDFS-16494
> URL: https://issues.apache.org/jira/browse/HDFS-16494
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.9.2, 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When building the AvailableSpaceVolumeChoosingPolicy, if the default 
> constructor is used, initLocks() will be used twice, which is actually 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()

2022-03-04 Thread JiangHua Zhu (Jira)
JiangHua Zhu created HDFS-16494:
---

 Summary: Removed reuse of 
AvailableSpaceVolumeChoosingPolicy#initLocks()
 Key: HDFS-16494
 URL: https://issues.apache.org/jira/browse/HDFS-16494
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 2.9.2, 3.4.0
Reporter: JiangHua Zhu


When building the AvailableSpaceVolumeChoosingPolicy, if the default 
constructor is used, initLocks() will be used twice, which is actually 
unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-16494) Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()

2022-03-04 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JiangHua Zhu reassigned HDFS-16494:
---

Assignee: JiangHua Zhu

> Removed reuse of AvailableSpaceVolumeChoosingPolicy#initLocks()
> ---
>
> Key: HDFS-16494
> URL: https://issues.apache.org/jira/browse/HDFS-16494
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.9.2, 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>
> When building the AvailableSpaceVolumeChoosingPolicy, if the default 
> constructor is used, initLocks() will be used twice, which is actually 
> unnecessary.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-16476) Increase the number of metrics used to record PendingRecoveryBlocks

2022-02-22 Thread JiangHua Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16476 started by JiangHua Zhu.
---
> Increase the number of metrics used to record PendingRecoveryBlocks
> ---
>
> Key: HDFS-16476
> URL: https://issues.apache.org/jira/browse/HDFS-16476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: metrics, namenode
>Affects Versions: 2.9.2, 3.4.0
>Reporter: JiangHua Zhu
>Assignee: JiangHua Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The complete process of block recovery is as follows:
> 1. NameNode collects which blocks need to be recovered.
> 2. The NameNode issues instructions to some DataNodes for execution.
> 3. DataNode tells NameNode after execution is complete.
> Now there is no way to know how many blocks are being recovered. The number 
> of metrics used to record PendingRecoveryBlocks should be increased, which is 
> good for increasing the robustness of the cluster.
> Here are some logs of DataNode execution:
> 2022-02-10 23:51:04,386 [12208592621] - INFO  [IPC Server handler 38 on 
> 8025:FsDatasetImpl@2687] - initReplicaRecovery: changing replica state for 
> blk_ from RBW to RUR
> 2022-02-10 23:51:04,395 [12208592630] - INFO  [IPC Server handler 47 on 
> 8025:FsDatasetImpl@2708] - updateReplica: BP-:blk_, 
> recoveryId=18386356475, length=129869866, replica=ReplicaUnderRecovery, 
> blk_, RUR
> Here are some logs that NameNdoe receives after completion:
> 2022-02-22 10:43:58,780 [8193058814] - INFO  [IPC Server handler 15 on 
> 8021:FSNamesystem@3647] - commitBlockSynchronization(oldBlock=BP-, 
> newgenerationstamp=18551926574, newlength=16929, newtargets=[1:1004, 
> 2:1004, 3:1004]) successful



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   >