[jira] [Updated] (HDFS-17298) Fix NPE in DataNode.handleBadBlock and BlockSender

2023-12-25 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated HDFS-17298:

Fix Version/s: 3.3.9

> Fix NPE in DataNode.handleBadBlock and BlockSender
> --
>
> Key: HDFS-17298
> URL: https://issues.apache.org/jira/browse/HDFS-17298
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> There are some NPE issues on the DataNode side of our online environment.
> The detailed exception information is
> {code:java}
> 2023-12-20 13:58:25,449 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client DFSClient_NONMAPREDUCE_xxx at /xxx:41452 [Sending 
> block BP-xxx:blk_xxx]] - xxx:50010:DataXceiver error processing READ_BLOCK 
> operation  src: /xxx:41452 dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:301)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> if (!fromScanner && blockScanner.isEnabled()) {
>   // data.getVolume(block) is null
>   blockScanner.markSuspectBlock(data.getVolume(block).getStorageID(),
>   block);
> } 
> {code}
> {code:java}
> 2023-12-20 13:52:18,844 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client /xxx:61052 [Copying block BP-xxx:blk_xxx]] - 
> xxx:50010:DataXceiver error processing COPY_BLOCK operation  src: /xxx:61052 
> dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleBadBlock(DataNode.java:4045)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1163)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> // Obtain a reference before reading data
> volumeRef = datanode.data.getVolume(block).obtainReference(); 
> //datanode.data.getVolume(block) is null  
> {code}
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17303) Make WINDOW_SIZE and NUM_WINDOWS configurable.

2023-12-25 Thread huangzhaobo99 (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangzhaobo99 updated HDFS-17303:
-
Description: 
1. The delay reported by DN to NN is an average delay of 3 hours, which 
confuses me.
MutableRollingAverages: WINDOW_SIZE_MS_DEFAULT = 300_000 NUM_WINDOWS_DEFAULT = 
36
2. There is a time limit for SlowNodes collected by nn, which is currently set 
to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the time 
threshold for printing SlowNode logs written downstream is 300ms 
(dfs.datanode.low.io.warning.threshold.ms).
3. Can changeing this window size and num to configurable param?

  was:
1. The delay reported by DN to NN is an average delay of 3 hours, which 
confuses me.
2. There is a time limit for SlowNodes collected by nn, which is currently set 
to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the time 
threshold for printing SlowNode logs written downstream is 300ms 
(dfs.datanode.low.io.warning.threshold.ms).
3. Can changeing this window size and num to configurable param?


> Make WINDOW_SIZE and NUM_WINDOWS configurable.
> --
>
> Key: HDFS-17303
> URL: https://issues.apache.org/jira/browse/HDFS-17303
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huangzhaobo99
>Priority: Major
>
> 1. The delay reported by DN to NN is an average delay of 3 hours, which 
> confuses me.
> MutableRollingAverages: WINDOW_SIZE_MS_DEFAULT = 300_000 NUM_WINDOWS_DEFAULT 
> = 36
> 2. There is a time limit for SlowNodes collected by nn, which is currently 
> set to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the 
> time threshold for printing SlowNode logs written downstream is 300ms 
> (dfs.datanode.low.io.warning.threshold.ms).
> 3. Can changeing this window size and num to configurable param?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17303) Make WINDOW_SIZE and NUM_WINDOWS configurable.

2023-12-25 Thread huangzhaobo99 (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800399#comment-17800399
 ] 

huangzhaobo99 commented on HDFS-17303:
--

Hi, [~ayushtkn] [~tasanuma] [~tomscut],  Everyone, if you have time, take a 
look this issue, Seek help, Thanks.

> Make WINDOW_SIZE and NUM_WINDOWS configurable.
> --
>
> Key: HDFS-17303
> URL: https://issues.apache.org/jira/browse/HDFS-17303
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huangzhaobo99
>Priority: Major
>
> 1. The delay reported by DN to NN is an average delay of 3 hours, which 
> confuses me.
> 2. There is a time limit for SlowNodes collected by nn, which is currently 
> set to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the 
> time threshold for printing SlowNode logs written downstream is 300ms 
> (dfs.datanode.low.io.warning.threshold.ms).
> 3. Can changeing this window size and num to configurable param?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17303) Make WINDOW_SIZE and NUM_WINDOWS configurable.

2023-12-25 Thread huangzhaobo99 (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangzhaobo99 updated HDFS-17303:
-
Description: 
1. The delay reported by DN to NN is an average delay of 3 hours, which 
confuses me.
2. There is a time limit for SlowNodes collected by nn, which is currently set 
to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the time 
threshold for printing SlowNode logs written downstream is 300ms 
(dfs.datanode.low.io.warning.threshold.ms).
3. Can changeing this window size and num to configurable param?

> Make WINDOW_SIZE and NUM_WINDOWS configurable.
> --
>
> Key: HDFS-17303
> URL: https://issues.apache.org/jira/browse/HDFS-17303
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huangzhaobo99
>Priority: Major
>
> 1. The delay reported by DN to NN is an average delay of 3 hours, which 
> confuses me.
> 2. There is a time limit for SlowNodes collected by nn, which is currently 
> set to 5ms by default (dfs.datanode.slowpeer.low.threshold.ms), while the 
> time threshold for printing SlowNode logs written downstream is 300ms 
> (dfs.datanode.low.io.warning.threshold.ms).
> 3. Can changeing this window size and num to configurable param?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17303) Make WINDOW_SIZE and NUM_WINDOWS configurable.

2023-12-25 Thread huangzhaobo99 (Jira)
huangzhaobo99 created HDFS-17303:


 Summary: Make WINDOW_SIZE and NUM_WINDOWS configurable.
 Key: HDFS-17303
 URL: https://issues.apache.org/jira/browse/HDFS-17303
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: huangzhaobo99






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17301) Add read and write dataXceiver threads count metrics to datanode.

2023-12-25 Thread huangzhaobo99 (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800394#comment-17800394
 ] 

huangzhaobo99 commented on HDFS-17301:
--

Hi, [~ayushtkn] [~tasanuma], Please help review this pr when you are available, 
Thanks.

> Add read and write dataXceiver threads count metrics to datanode.
> -
>
> Key: HDFS-17301
> URL: https://issues.apache.org/jira/browse/HDFS-17301
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huangzhaobo99
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: pull-request-available
>
> # The DataNodeActiveXeiversCount metric contains the number of threads of all 
> Op types.
>  # In most cases, we focus more on the number of read and write dataXceiver 
> threads, so add read and write dataXceiver threads count metrics to datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17301) Add read and write dataXceiver threads count metrics to datanode.

2023-12-25 Thread huangzhaobo99 (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangzhaobo99 updated HDFS-17301:
-
Description: 
# The DataNodeActiveXeiversCount metric contains the number of threads of all 
Op types.
 # In most cases, we focus more on the number of read and write dataXceiver 
threads, so add read and write dataXceiver threads count metrics to datanode.

> Add read and write dataXceiver threads count metrics to datanode.
> -
>
> Key: HDFS-17301
> URL: https://issues.apache.org/jira/browse/HDFS-17301
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huangzhaobo99
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: pull-request-available
>
> # The DataNodeActiveXeiversCount metric contains the number of threads of all 
> Op types.
>  # In most cases, we focus more on the number of read and write dataXceiver 
> threads, so add read and write dataXceiver threads count metrics to datanode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-25 Thread huangzhaobo99 (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800393#comment-17800393
 ] 

huangzhaobo99 commented on HDFS-17056:
--

[~ayushtkn] Thanks for your Merge.

> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 3.4.0
>
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17298) Fix NPE in DataNode.handleBadBlock and BlockSender

2023-12-25 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma resolved HDFS-17298.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

> Fix NPE in DataNode.handleBadBlock and BlockSender
> --
>
> Key: HDFS-17298
> URL: https://issues.apache.org/jira/browse/HDFS-17298
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> There are some NPE issues on the DataNode side of our online environment.
> The detailed exception information is
> {code:java}
> 2023-12-20 13:58:25,449 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client DFSClient_NONMAPREDUCE_xxx at /xxx:41452 [Sending 
> block BP-xxx:blk_xxx]] - xxx:50010:DataXceiver error processing READ_BLOCK 
> operation  src: /xxx:41452 dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:301)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> if (!fromScanner && blockScanner.isEnabled()) {
>   // data.getVolume(block) is null
>   blockScanner.markSuspectBlock(data.getVolume(block).getStorageID(),
>   block);
> } 
> {code}
> {code:java}
> 2023-12-20 13:52:18,844 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client /xxx:61052 [Copying block BP-xxx:blk_xxx]] - 
> xxx:50010:DataXceiver error processing COPY_BLOCK operation  src: /xxx:61052 
> dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleBadBlock(DataNode.java:4045)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1163)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> // Obtain a reference before reading data
> volumeRef = datanode.data.getVolume(block).obtainReference(); 
> //datanode.data.getVolume(block) is null  
> {code}
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17298) Fix NPE in DataNode.handleBadBlock and BlockSender

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800390#comment-17800390
 ] 

ASF GitHub Bot commented on HDFS-17298:
---

tasanuma commented on PR #6374:
URL: https://github.com/apache/hadoop/pull/6374#issuecomment-1869169552

   Merged. Thanks for your PR, @haiyang1987!




> Fix NPE in DataNode.handleBadBlock and BlockSender
> --
>
> Key: HDFS-17298
> URL: https://issues.apache.org/jira/browse/HDFS-17298
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> There are some NPE issues on the DataNode side of our online environment.
> The detailed exception information is
> {code:java}
> 2023-12-20 13:58:25,449 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client DFSClient_NONMAPREDUCE_xxx at /xxx:41452 [Sending 
> block BP-xxx:blk_xxx]] - xxx:50010:DataXceiver error processing READ_BLOCK 
> operation  src: /xxx:41452 dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:301)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> if (!fromScanner && blockScanner.isEnabled()) {
>   // data.getVolume(block) is null
>   blockScanner.markSuspectBlock(data.getVolume(block).getStorageID(),
>   block);
> } 
> {code}
> {code:java}
> 2023-12-20 13:52:18,844 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client /xxx:61052 [Copying block BP-xxx:blk_xxx]] - 
> xxx:50010:DataXceiver error processing COPY_BLOCK operation  src: /xxx:61052 
> dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleBadBlock(DataNode.java:4045)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1163)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> // Obtain a reference before reading data
> volumeRef = datanode.data.getVolume(block).obtainReference(); 
> //datanode.data.getVolume(block) is null  
> {code}
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17298) Fix NPE in DataNode.handleBadBlock and BlockSender

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800389#comment-17800389
 ] 

ASF GitHub Bot commented on HDFS-17298:
---

tasanuma merged PR #6374:
URL: https://github.com/apache/hadoop/pull/6374




> Fix NPE in DataNode.handleBadBlock and BlockSender
> --
>
> Key: HDFS-17298
> URL: https://issues.apache.org/jira/browse/HDFS-17298
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> There are some NPE issues on the DataNode side of our online environment.
> The detailed exception information is
> {code:java}
> 2023-12-20 13:58:25,449 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client DFSClient_NONMAPREDUCE_xxx at /xxx:41452 [Sending 
> block BP-xxx:blk_xxx]] - xxx:50010:DataXceiver error processing READ_BLOCK 
> operation  src: /xxx:41452 dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:301)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> if (!fromScanner && blockScanner.isEnabled()) {
>   // data.getVolume(block) is null
>   blockScanner.markSuspectBlock(data.getVolume(block).getStorageID(),
>   block);
> } 
> {code}
> {code:java}
> 2023-12-20 13:52:18,844 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client /xxx:61052 [Copying block BP-xxx:blk_xxx]] - 
> xxx:50010:DataXceiver error processing COPY_BLOCK operation  src: /xxx:61052 
> dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleBadBlock(DataNode.java:4045)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1163)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> // Obtain a reference before reading data
> volumeRef = datanode.data.getVolume(block).obtainReference(); 
> //datanode.data.getVolume(block) is null  
> {code}
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17215) RBF: Fix some method annotations about @throws

2023-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800356#comment-17800356
 ] 

Ayush Saxena commented on HDFS-17215:
-

Committed to trunk.
Thanx [~bigdata_zoodev] for the contribution!!!

> RBF: Fix some method annotations about @throws 
> ---
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17215) RBF: Fix some method annotations about @throws

2023-12-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-17215.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> RBF: Fix some method annotations about @throws 
> ---
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17215) RBF: Fix some method annotations about @throws

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800355#comment-17800355
 ] 

ASF GitHub Bot commented on HDFS-17215:
---

ayushtkn merged PR #6136:
URL: https://github.com/apache/hadoop/pull/6136




> RBF: Fix some method annotations about @throws 
> ---
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17215) RBF: Fix some method annotations about @throws

2023-12-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-17215:

Summary: RBF: Fix some method annotations about @throws   (was: RBF: fix 
some method annotations about @throws )

> RBF: Fix some method annotations about @throws 
> ---
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17215) The setQuota method annotation of the Quota class has an error

2023-12-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena reassigned HDFS-17215:
---

Assignee: xiaojunxiang

> The setQuota method annotation of the Quota class has an error
> --
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17215) RBF: fix some method annotations about @throws

2023-12-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-17215:

Summary: RBF: fix some method annotations about @throws   (was: The 
setQuota method annotation of the Quota class has an error)

> RBF: fix some method annotations about @throws 
> ---
>
> Key: HDFS-17215
> URL: https://issues.apache.org/jira/browse/HDFS-17215
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.3.4
>Reporter: xiaojunxiang
>Assignee: xiaojunxiang
>Priority: Minor
>  Labels: pull-request-available
>
> The setQuota method annotation of the Quota class has an error, which is 
> described in the @throws section.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-25 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena resolved HDFS-17056.
-
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
> Fix For: 3.4.0
>
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-25 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800339#comment-17800339
 ] 

Ayush Saxena commented on HDFS-17056:
-

Committed to trunk.
Thanx [~huangzhaobo99] for the contribution!!!

> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800338#comment-17800338
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

ayushtkn merged PR #6379:
URL: https://github.com/apache/hadoop/pull/6379




> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17284) Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800328#comment-17800328
 ] 

ASF GitHub Bot commented on HDFS-17284:
---

tasanuma commented on code in PR #6348:
URL: https://github.com/apache/hadoop/pull/6348#discussion_r1436119492


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestDatanodeManager.java:
##
@@ -1126,4 +1128,66 @@ public MockDfsNetworkTopology(){
   super();
 }
   }
+
+  @Test
+  public void testComputeReconstructedTaskNum() throws IOException {
+verifyComputeReconstructedTaskNum(100, 100, 150, 250, 100);
+verifyComputeReconstructedTaskNum(200, 10, 20, 30, 40);
+verifyComputeReconstructedTaskNum(100, 100, 150, 250, 100);
+verifyComputeReconstructedTaskNum(1400, 200, 200, 400, 200);
+
+  }
+  public void verifyComputeReconstructedTaskNum(int xmitsInProgress, int 
numReplicationBlocks,
+  int maxTransfers, int numECTasksToBeReplicated, int 
numBlocksToBeErasureCoded)
+  throws IOException {
+FSNamesystem fsn = Mockito.mock(FSNamesystem.class);
+Mockito.when(fsn.hasWriteLock()).thenReturn(true);
+Configuration conf = new Configuration();
+conf.setInt(DFSConfigKeys.DFS_NAMENODE_REPLICATION_MAX_STREAMS_KEY, 
maxTransfers);
+DatanodeManager dm = Mockito.spy(mockDatanodeManager(fsn, conf));
+
+DatanodeDescriptor nodeInfo = Mockito.mock(DatanodeDescriptor.class);
+Mockito.when(nodeInfo.isRegistered()).thenReturn(true);
+Mockito.when(nodeInfo.getStorageInfos()).thenReturn(new 
DatanodeStorageInfo[0]);
+
+
Mockito.when(nodeInfo.getNumberOfReplicateBlocks()).thenReturn(numReplicationBlocks);
+
Mockito.when(nodeInfo.getNumberOfECBlocksToBeReplicated()).thenReturn(numECTasksToBeReplicated);
+Mockito.when(nodeInfo.getNumberOfBlocksToBeErasureCoded())
+.thenReturn(numBlocksToBeErasureCoded);
+
+// Create an ArgumentCaptor to capture the counts for numReplicationTasks,
+// numEcReplicatedTasks,numECReconstructedTasks.
+ArgumentCaptor captor = ArgumentCaptor.forClass(Integer.class);
+Mockito.when(nodeInfo.getErasureCodeCommand(ArgumentMatchers.anyInt()))
+.thenReturn(Collections.nCopies(0, null));
+Mockito.when(nodeInfo.getReplicationCommand(ArgumentMatchers.anyInt()))
+.thenReturn(Collections.nCopies(0, null));
+Mockito.when(nodeInfo.getECReplicatedCommand(ArgumentMatchers.anyInt()))
+.thenReturn(Collections.nCopies(0, null));
+
+DatanodeRegistration nodeReg = Mockito.mock(DatanodeRegistration.class);
+Mockito.when(dm.getDatanode(nodeReg)).thenReturn(nodeInfo);
+
+
+dm.handleHeartbeat(nodeReg, new StorageReport[1], "bp-123", 0, 0,
+10, xmitsInProgress, 0, null, SlowPeerReports.EMPTY_REPORT,
+SlowDiskReports.EMPTY_REPORT);
+
+Mockito.verify(nodeInfo).getReplicationCommand(captor.capture());
+int numReplicationTasks = captor.getValue();
+
+Mockito.verify(nodeInfo).getECReplicatedCommand(captor.capture());
+int numEcReplicatedTasks = captor.getValue();
+
+Mockito.verify(nodeInfo).getErasureCodeCommand(captor.capture());
+int numECReconstructedTasks = captor.getValue();
+
+// Verify that when DN xmitsInProgress exceeds maxTransfers,
+// the number of tasks should be <= 0.
+if(xmitsInProgress >= maxTransfers){
+  assertTrue(numReplicationTasks <= 0);
+  assertTrue(numEcReplicatedTasks <= 0);
+  assertTrue(numECReconstructedTasks <= 0);
+}

Review Comment:
   To ensure that `verifyComputeReconstructedTaskNum(200, 10, 20, 
30, 40)` doesn't cause an overflow, it would be necessary to check the 
reverse situation as well.
   ```suggestion
   } else {
 assertTrue(numReplicationTasks >= 0);
 assertTrue(numEcReplicatedTasks >= 0);
 assertTrue(numECReconstructedTasks >= 0);
   }
   ```
   
   Other than that, it looks good to me.





> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery
> --
>
> Key: HDFS-17284
> URL: https://issues.apache.org/jira/browse/HDFS-17284
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Hualong Zhang
>Assignee: Hualong Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800326#comment-17800326
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

hadoop-yetus commented on PR #6382:
URL: https://github.com/apache/hadoop/pull/6382#issuecomment-1869031045

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 13s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m  5s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  30m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  16m  1s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  14m 33s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 19s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m  5s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 20s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 40s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m 57s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 31s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m 58s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  15m 22s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  15m 22s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  14m 39s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  14m 39s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   4m  5s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6382/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 
267)  |
   | +1 :green_heart: |  mvnsite  |   3m  2s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 14s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  5s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 54s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  19m 11s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 219m 52s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  4s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 466m 51s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6382/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6382 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux b9dfd8abc97f 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / cbd1c31242176b1e1df1e05a0e72fd142646ebf5 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6382/1/testReport/ |
   

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-25 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800322#comment-17800322
 ] 

Takanobu Asanuma commented on HDFS-17299:
-

I also agree with the implementation of a bestEffort approach on the client 
side when creating a pipeline. Addressing this issue on the NameNode side would 
likely be difficult due to the complexity involved in managing rack status.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800315#comment-17800315
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

hadoop-yetus commented on PR #6383:
URL: https://github.com/apache/hadoop/pull/6383#issuecomment-1868994410

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   7m 42s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  1s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 30s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m 39s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 17s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   7m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   2m  2s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m  6s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 45s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 20s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   8m  0s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   8m  0s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   7m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 57s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6383/1/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 
267)  |
   | +1 :green_heart: |  mvnsite  |   1m 39s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 42s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  16m 21s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 199m 50s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6383/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 353m 14s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestDecommissionWithStripedBackoffMonitor 
|
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestDFSStripedOutputStream |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6383/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6383 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 7a30dafd7c3b 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-25 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800309#comment-17800309
 ] 

Xiaoqiao He commented on HDFS-17299:


{quote}Maybe we should consider dropping the datanode from the pipeline, If 
possible, if we can't replace & reattempt with the remaining datanodes. 
Similarly as bestEffort in normal DatanodeReplacement case post the stream has 
been created.
{quote}
+1, seems there is no other more smooth solution for this case. cc [~shahrs87] 
would you like to contribute and fix it? We will involve here once need any 
help. Thanks.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> 

[jira] [Commented] (HDFS-17284) Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks during block recovery

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800301#comment-17800301
 ] 

ASF GitHub Bot commented on HDFS-17284:
---

slfan1989 commented on PR #6348:
URL: https://github.com/apache/hadoop/pull/6348#issuecomment-1868938439

   LGTM.




> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery
> --
>
> Key: HDFS-17284
> URL: https://issues.apache.org/jira/browse/HDFS-17284
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Hualong Zhang
>Assignee: Hualong Zhang
>Priority: Major
>  Labels: pull-request-available
>
> Fix int overflow in calculating numEcReplicatedTasks and numReplicationTasks 
> during block recovery



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800297#comment-17800297
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

hadoop-yetus commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1868937265

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  37m 38s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  37m 59s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  37m 29s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m  8s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 159m 35s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6380 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f46c9e9ec436 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 16dd7df424d7b19cfd19b6504661fdebad7b5e01 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/2/testReport/ |
   | Max. process+thread count | 2412 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns. In addition, 
when I reconfigure the total number of handlers on the router, I have to 
re-allocate handlers to each ns, which undoubtedly increases the complexity of 
operation and maintenance.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing and isolation* :  Sharing is as important as isolation. We support 
that the sum of handlers for all ns exceeds the total number of handlers. For 
example, assuming we have 10 handlers and 3 ns, we can allocate 5 (0.5) 
handlers to ns1, 5 (0.5) handlers to ns2, and 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-25 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Attachment: HDFS-17302.003.patch

> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch, 
> HDFS-17302.003.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns. In addition, when I reconfigure the total number of handlers on 
> the router, I have to re-allocate handlers to each ns, which undoubtedly 
> increases the complexity of operation and maintenance.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of the router.
> h2. New features
> Based on the above staticRouterRpcFairnessPolicyController, there are 
> deficiencies in usage and performance. I provide a new 
>