[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800266#comment-17800266
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

LiuGuH opened a new pull request, #6383:
URL: https://github.com/apache/hadoop/pull/6383

   …ateid is always delayed with Active Namenode for a period of time
   
   
   
   ### Description of PR
   Now when Observer NN is used,  if the stateid is delayed , the rpcServer 
will be requeued into callqueue. If EditLogTailer is broken or something else 
wrong , the call will be requeued again and again.  
   
   So Observer should throw ObserverRetryOnActiveException if stateid is always 
delayed with Active Namenode for a configured time.
   




> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a  configured time
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800264#comment-17800264
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

LiuGuH closed pull request #6382: HDFS-17300. [SBN READ] Observer should throw 
ObserverRetryOnActiveException if stateid is always delayed with Active 
Namenode for a configured time 
URL: https://github.com/apache/hadoop/pull/6382




> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a  configured time
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800263#comment-17800263
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

LiuGuH opened a new pull request, #6382:
URL: https://github.com/apache/hadoop/pull/6382

   
   
   ### Description of PR
   
   Now when Observer NN is used, if the stateid is delayed , the rpcServer will 
be requeued into callqueue. If EditLogTailer is broken or something else wrong 
, the call will be requeued again and again.
   So Observer should throw ObserverRetryOnActiveException if stateid is always 
delayed with Active Namenode for a configured time.
   
   




> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a  configured time
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800262#comment-17800262
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

KeeProMise commented on code in PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#discussion_r1435995958


##
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/fairness/ProportionRouterRpcFairnessPolicyController.java:
##
@@ -0,0 +1,76 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hdfs.server.federation.fairness;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hdfs.server.federation.router.FederationUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.util.Set;
+
+import static 
org.apache.hadoop.hdfs.server.federation.fairness.RouterRpcFairnessConstants.CONCURRENT_NS;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_FAIR_HANDLER_PROPORTION_DEFAULT;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_FAIR_HANDLER_PROPORTION_KEY_PREFIX;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_HANDLER_COUNT_DEFAULT;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_HANDLER_COUNT_KEY;
+
+/**
+ * Proportion fairness policy extending {@link 
AbstractRouterRpcFairnessPolicyController}
+ * and fetching proportion of handlers from configuration for all available 
name services,
+ * based on the proportion and the total number of handlers, calculate the 
handlers of all ns.
+ * The handlers count will not change for this controller.
+ */
+public class ProportionRouterRpcFairnessPolicyController extends
+AbstractRouterRpcFairnessPolicyController{
+
+  private static final Logger LOG =
+  
LoggerFactory.getLogger(ProportionRouterRpcFairnessPolicyController.class);
+
+  public ProportionRouterRpcFairnessPolicyController(Configuration conf){
+init(conf);
+  }
+
+  @Override
+  public void init(Configuration conf) {
+super.init(conf);
+// Total handlers configured to process all incoming Rpc.
+int handlerCount = conf.getInt(DFS_ROUTER_HANDLER_COUNT_KEY, 
DFS_ROUTER_HANDLER_COUNT_DEFAULT);
+
+LOG.info("Handlers available for fairness assignment {} ", handlerCount);
+
+// Get all name services configured
+Set allConfiguredNS = FederationUtil.getAllConfiguredNS(conf);
+
+// Insert the concurrent nameservice into the set to process together
+allConfiguredNS.add(CONCURRENT_NS);
+for (String nsId : allConfiguredNS) {

Review Comment:
   You can take a look at https://issues.apache.org/jira/browse/HDFS-17302 for 
a detailed description.





> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to 

[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800261#comment-17800261
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

LiuGuH closed pull request #6376: HDFS-17300. [SBN READ] Observer should throw 
ObserverRetryOnActiveException if stateid is always delayed with Active 
Namenode for a configured time 
URL: https://github.com/apache/hadoop/pull/6376




> [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a  configured time
> 
>
> Key: HDFS-17300
> URL: https://issues.apache.org/jira/browse/HDFS-17300
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: liuguanghua
>Priority: Major
>  Labels: pull-request-available
>
>            Now when Observer NN is used,  if the stateid is delayed , the 
> rpcServer will be requeued into callqueue. If EditLogTailer is broken or 
> something else wrong , the call will be requeued again and again.  
>         So Observer should throw ObserverRetryOnActiveException if stateid is 
> always delayed with Active Namenode for a configured time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800260#comment-17800260
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435991712


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   @ayushtkn, That won't change for now. Thanks your guidance!





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns. In addition, 
when I reconfigure the total number of handlers on the router, I have to 
re-allocate handlers to each ns, which undoubtedly increases the complexity of 
operation and maintenance.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing and isolation* :  Sharing is as important as isolation. We support 
that the sum of handlers for all ns exceeds the total number of handlers. For 
example, assuming we have 10 handlers and 3 ns, we can allocate 5 (0.5) 
handlers to ns1, 5 (0.5) handlers to ns2, and 

[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800259#comment-17800259
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

ayushtkn commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435987418


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   we can change but it won't be consistent with others
   
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java#L115-L117
   
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java#L115-L117
   
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java#L115-L117





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing and isolation* :  Sharing is as important as isolation. We support 
that the sum of handlers for all ns exceeds the total number of handlers. For 
example, assuming we have 10 handlers and 3 ns, we can allocate 5 (0.5) 
handlers to ns1, 5 (0.5) handlers to ns2, and ns3 also allocates 5 (0.5) 
handlers.This feature is very important,.Consider the following scenarios:
- Only one ns is busy during a period of time: Assume that ns1 has more 
requests from 0 to 8 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing and isolation* :  Sharing is as important as isolation. We support 
that the sum of handlers for all ns exceeds the total number of handlers. For 
example, assuming we have 10 handlers and 3 ns, we can allocate 5 (0.5) 
handlers to ns1, 5 (0.5) handlers to ns2, and ns3 also allocates 5 (0.5) 
handlers.This feature is very important,.Consider the following scenarios:
- Only one ns is busy during a period of time: Assume that ns1 has more 
requests from 0 to 8 

[jira] [Commented] (HDFS-17300) [SBN READ] Observer should throw ObserverRetryOnActiveException if stateid is always delayed with Active Namenode for a configured time

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800257#comment-17800257
 ] 

ASF GitHub Bot commented on HDFS-17300:
---

hadoop-yetus commented on PR #6376:
URL: https://github.com/apache/hadoop/pull/6376#issuecomment-1868814964

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 21s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 51s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  19m  1s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   8m 14s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   7m 26s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   2m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 11s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  20m 47s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 21s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   1m  7s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 51s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   7m 51s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   7m 22s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   7m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m 55s | 
[/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6376/2/artifact/out/results-checkstyle-root.txt)
 |  root: The patch generated 3 new + 267 unchanged - 0 fixed = 270 total (was 
267)  |
   | +1 :green_heart: |  mvnsite  |   1m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  20m 45s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  15m 47s |  |  hadoop-common in the patch 
passed.  |
   | -1 :x: |  unit  | 200m 54s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6376/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 38s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 345m 51s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   |   | hadoop.hdfs.TestDFSStripedOutputStream |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6376/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6376 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint |
   | uname | Linux 59544ada24be 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing and isolation* :  Sharing is as important as isolation. We support 
that the sum of handlers for all ns exceeds the total number of handlers. For 
example, assuming we have 10 handlers and 3 ns, we can allocate 5 (0.5) 
handlers to ns1, 5 (0.5) handlers to ns2, and ns3 also allocates 5 (0.5) 
handlers.This feature is very important,.Consider the following scenarios:
- Only one ns is busy during a period of time: Assume that ns1 has more 
requests from 0 to 8 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Summary: RBF: ProportionRouterRpcFairnessPolicyController-Sharing and 
isolation.  (was: RBF: ProportionRouterRpcFairnessPolicyController-Sharing and 
Isolating.)

> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of the router.
> h2. New features
> Based on the above staticRouterRpcFairnessPolicyController, there are 
> deficiencies in usage and performance. I provide a new 
> RouterRpcFairnessPolicyController: 
> ProportionRouterRpcFairnessPolicyController (maybe with a better name) 

[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800255#comment-17800255
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

huangzhaobo99 commented on code in PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#discussion_r1435977171


##
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/fairness/ProportionRouterRpcFairnessPolicyController.java:
##
@@ -0,0 +1,76 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hdfs.server.federation.fairness;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hdfs.server.federation.router.FederationUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import java.util.Set;
+
+import static 
org.apache.hadoop.hdfs.server.federation.fairness.RouterRpcFairnessConstants.CONCURRENT_NS;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_FAIR_HANDLER_PROPORTION_DEFAULT;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_FAIR_HANDLER_PROPORTION_KEY_PREFIX;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_HANDLER_COUNT_DEFAULT;
+import static 
org.apache.hadoop.hdfs.server.federation.router.RBFConfigKeys.DFS_ROUTER_HANDLER_COUNT_KEY;
+
+/**
+ * Proportion fairness policy extending {@link 
AbstractRouterRpcFairnessPolicyController}
+ * and fetching proportion of handlers from configuration for all available 
name services,
+ * based on the proportion and the total number of handlers, calculate the 
handlers of all ns.
+ * The handlers count will not change for this controller.
+ */
+public class ProportionRouterRpcFairnessPolicyController extends
+AbstractRouterRpcFairnessPolicyController{
+
+  private static final Logger LOG =
+  
LoggerFactory.getLogger(ProportionRouterRpcFairnessPolicyController.class);
+
+  public ProportionRouterRpcFairnessPolicyController(Configuration conf){
+init(conf);
+  }
+
+  @Override
+  public void init(Configuration conf) {
+super.init(conf);
+// Total handlers configured to process all incoming Rpc.
+int handlerCount = conf.getInt(DFS_ROUTER_HANDLER_COUNT_KEY, 
DFS_ROUTER_HANDLER_COUNT_DEFAULT);
+
+LOG.info("Handlers available for fairness assignment {} ", handlerCount);
+
+// Get all name services configured
+Set allConfiguredNS = FederationUtil.getAllConfiguredNS(conf);
+
+// Insert the concurrent nameservice into the set to process together
+allConfiguredNS.add(CONCURRENT_NS);
+for (String nsId : allConfiguredNS) {

Review Comment:
   From the allocation perspective, Basically consistent with the 
StaticRouterPCFairnessPolicyController policy. Can you share on your ideas? Thx.





> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and isolation.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-Sharing and Isolating.

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Summary: RBF: ProportionRouterRpcFairnessPolicyController-Sharing and 
Isolating.  (was: RBF: ProportionRouterRpcFairnessPolicyController-support 
proportional allocation of semaphores)

> RBF: ProportionRouterRpcFairnessPolicyController-Sharing and Isolating.
> ---
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> h2. Current shortcomings
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. *Configuration is inconvenient and error-prone*: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> handlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns.
> 2. *Extension ns is not supported*: During the running of the router, if a 
> new ns is added to the cluster and a mount is added for the ns, but because 
> no handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 3. *Waste handlers*:  The main purpose of proposing 
> RouterRpcFairnessPolicyController is to enable the router to access ns with 
> normal load and not be affected by ns with higher load. First of all, not all 
> ns have high loads; secondly, ns with high loads do not have high loads 24 
> hours a day. It may be that only certain time periods, such as 0 to 8 
> o'clock, have high loads, and other time periods have normal loads. Assume 
> there are 2 ns, and each ns is allocated half of the number of handlers. 
> Assume that ns1 has many requests from 0 to 14 o'clock, and almost no 
> requests from 14 to 24 o'clock, ns2 has many requests from 12 to 24 o'clock, 
> and almost no requests from 0 to 14 o'clock; when it is between 0 o'clock and 
> 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has more 
> requests and the other ns has almost no requests, so we have wasted half of 
> the number of handlers.
> 4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
> does not support sharing, only isolation. I think isolation is just a means 
> to improve the performance of router access to normal ns, not the purpose. It 
> is impossible for all ns in the cluster to have high loads. On the contrary, 
> in most scenarios, only a few ns in the cluster have high loads, and the 
> loads of most other ns are normal. For ns with higher load and ns with normal 
> load, we need to isolate their handlers so that the ns with higher load will 
> not affect the performance of ns with lower load. However, for nameservices 
> that are also under normal load, or are under higher load, we do not need to 
> isolate them, these ns of the same nature can share the handlers of the 
> router; The performance is better than assigning a fixed number of handlers 
> to each ns, because each ns can use all the handlers of the router.
> h2. New features
> Based on the above staticRouterRpcFairnessPolicyController, there are 
> deficiencies in usage and performance. I provide a new 
> RouterRpcFairnessPolicyController: 
> ProportionRouterRpcFairnessPolicyController 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
h2. Current shortcomings

[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


h2. New features
Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1. *More user-friendly configuration* : Supports allocating handlers 
proportionally to each ns. For example, we can give ns1 a handler ratio of 0.2, 
then ns1 will use 0.2 of the total number of handlers on the router. Using this 
method, we do not need to confirm in advance how many handlers the router has.

2. *Sharing* :  




  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.


1.* More user-friendly configuration*: Supports allocating handlers 
proportionally to each ns.




  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. I provide a new 
RouterRpcFairnessPolicyController: ProportionRouterRpcFairnessPolicyController 
(maybe with a better name) to solve the above major shortcomings.






  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4. *Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.


Based on the above staticRouterRpcFairnessPolicyController, there are 
deficiencies in usage and performance. 




  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be 

[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800252#comment-17800252
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435970897


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   Thx, @haiyang1987, No problem! 
   Changing the error message:
   ```java
   System.err.println(getName() + ": Input invalid arguments.\nUsage: " + 
getLongUsage());
   ```
   How about it? cc @ayushtkn 





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 to 14 o'clock, and almost no requests from 14 to 24 
o'clock, ns2 has many requests from 12 to 24 o'clock, and almost no requests 
from 0 to 14 o'clock; when it is between 0 o'clock and 12 o'clock and between 
14 o'clock and 24 o'clock, only one ns has more requests and the other ns has 
almost no requests, so we have wasted half of the number of handlers.

4.*Only isolation, no sharing*: The staticRouterRpcFairnessPolicyController 
does not support sharing, only isolation. I think isolation is just a means to 
improve the performance of router access to normal ns, not the purpose. It is 
impossible for all ns in the cluster to have high loads. On the contrary, in 
most scenarios, only a few ns in the cluster have high loads, and the loads of 
most other ns are normal. For ns with higher load and ns with normal load, we 
need to isolate their handlers so that the ns with higher load will not affect 
the performance of ns with lower load. However, for nameservices that are also 
under normal load, or are under higher load, we do not need to isolate them, 
these ns of the same nature can share the handlers of the router; The 
performance is better than assigning a fixed number of handlers to each ns, 
because each ns can use all the handlers of the router.





  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is 

[jira] [Updated] (HDFS-17298) Fix NPE in DataNode.handleBadBlock and BlockSender

2023-12-24 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17298:
--
Component/s: datanode

> Fix NPE in DataNode.handleBadBlock and BlockSender
> --
>
> Key: HDFS-17298
> URL: https://issues.apache.org/jira/browse/HDFS-17298
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> There are some NPE issues on the DataNode side of our online environment.
> The detailed exception information is
> {code:java}
> 2023-12-20 13:58:25,449 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client DFSClient_NONMAPREDUCE_xxx at /xxx:41452 [Sending 
> block BP-xxx:blk_xxx]] - xxx:50010:DataXceiver error processing READ_BLOCK 
> operation  src: /xxx:41452 dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.(BlockSender.java:301)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> if (!fromScanner && blockScanner.isEnabled()) {
>   // data.getVolume(block) is null
>   blockScanner.markSuspectBlock(data.getVolume(block).getStorageID(),
>   block);
> } 
> {code}
> {code:java}
> 2023-12-20 13:52:18,844 ERROR datanode.DataNode (DataXceiver.java:run(330)) 
> [DataXceiver for client /xxx:61052 [Copying block BP-xxx:blk_xxx]] - 
> xxx:50010:DataXceiver error processing COPY_BLOCK operation  src: /xxx:61052 
> dst: /xxx:50010
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.handleBadBlock(DataNode.java:4045)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.copyBlock(DataXceiver.java:1163)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opCopyBlock(Receiver.java:291)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:113)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> NPE Code logic:
> {code:java}
> // Obtain a reference before reading data
> volumeRef = datanode.data.getVolume(block).obtainReference(); 
> //datanode.data.getVolume(block) is null  
> {code}
> We need to fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17297) The NameNode should remove block from the BlocksMap if the block is marked as deleted.

2023-12-24 Thread Haiyang Hu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haiyang Hu updated HDFS-17297:
--
Component/s: namanode

> The NameNode should remove block from the BlocksMap if the block is marked as 
> deleted.
> --
>
> Key: HDFS-17297
> URL: https://issues.apache.org/jira/browse/HDFS-17297
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Reporter: Haiyang Hu
>Assignee: Haiyang Hu
>Priority: Major
>  Labels: pull-request-available
>
> When call internalReleaseLease method:
> {code:java}
> boolean internalReleaseLease(
> ...
> int minLocationsNum = 1;
> if (lastBlock.isStriped()) {
>   minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
> }
> if (uc.getNumExpectedLocations() < minLocationsNum &&
> lastBlock.getNumBytes() == 0) {
>   // There is no datanode reported to this block.
>   // may be client have crashed before writing data to pipeline.
>   // This blocks doesn't need any recovery.
>   // We can remove this block and close the file.
>   pendingFile.removeLastBlock(lastBlock);
>   finalizeINodeFileUnderConstruction(src, pendingFile,
>   iip.getLatestSnapshotId(), false); 
> ...
> }
> {code}
>  if the condition `uc.getNumExpectedLocations() < minLocationsNum && 
> lastBlock.getNumBytes() == 0` is met during the execution of UNDER_RECOVERY 
> logic, the block is removed from the block list in the inode file and marked 
> as deleted. 
> However it is not removed from the BlocksMap, it may cause memory leak.
> Therefore it is necessary to remove the block from the BlocksMap at this 
> point as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800250#comment-17800250
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

hadoop-yetus commented on PR #6380:
URL: https://github.com/apache/hadoop/pull/6380#issuecomment-1868772311

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  17m 10s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m 41s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 22s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m  2s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | -0 :warning: |  patch  |  38m 23s |  |  Used diff version of patch file. 
Binary files and potentially other changes not applied. Please rebase and 
squash commits if necessary.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 18s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0)  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 28s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  37m 43s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  23m  6s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 34s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 176m 57s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6380 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 9178f4c3fcd6 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 10a9459725fc05183b95fb4917f679a45cbe1bc7 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6380/1/testReport/ |
   | Max. process+thread count | 2399 (vs. ulimit of 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Waste handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 o'clock to 14 o'clock, and almost no requests from 14 
o'clock to 24 o'clock. ns2 has many requests from 12 o'clock to 24 o'clock, and 
0 o'clock - There are almost no requests at 12 o'clock; when it is between 0 
o'clock and 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has 
more requests and the other ns has almost no requests, so we have wasted half 
of the number of handlers.



  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Cannot share handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of handlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Cannot share handlers*:  The main purpose of proposing 
RouterRpcFairnessPolicyController is to enable the router to access ns with 
normal load and not be affected by ns with higher load. First of all, not all 
ns have high loads; secondly, ns with high loads do not have high loads 24 
hours a day. It may be that only certain time periods, such as 0 to 8 o'clock, 
have high loads, and other time periods have normal loads. Assume there are 2 
ns, and each ns is allocated half of the number of handlers. Assume that ns1 
has many requests from 0 o'clock to 14 o'clock, and almost no requests from 14 
o'clock to 24 o'clock. ns2 has many requests from 12 o'clock to 24 o'clock, and 
0 o'clock - There are almost no requests at 12 o'clock; when it is between 0 
o'clock and 12 o'clock and between 14 o'clock and 24 o'clock, only one ns has 
more requests and the other ns has almost no requests, so we have wasted half 
of the number of handlers.



  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Cannot share handlers*:  




> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800248#comment-17800248
 ] 

Ayush Saxena commented on HDFS-17299:
-

Yeps, 

Excluding a rack in the streamer is quite tricky, we don't know the BPP neither 
the Cluster Rack configuration during the {{DataStreamer}} setup.

Maybe we should consider dropping the datanode from the pipeline, If possible, 
if we can't replace & reattempt with the remaining datanodes. Similarly as 
{{bestEffort}} in normal {{DatanodeReplacement}} case post the stream has been 
created.

[https://github.com/apache/hadoop/blob/rel/release-2.10.2/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/ReplaceDatanodeOnFailure.java#L114-L125]

Namenode... I don't think we have anything better than Stale node, which just 
brings the time duration down, rather than fixing.

rest I am also not very sure if there is any other clean way to handle this

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on 

[jira] [Commented] (HDFS-17254) DataNode httpServer has too many worker threads

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800247#comment-17800247
 ] 

ASF GitHub Bot commented on HDFS-17254:
---

xinglin commented on code in PR #6307:
URL: https://github.com/apache/hadoop/pull/6307#discussion_r1435954635


##
hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml:
##
@@ -154,6 +154,14 @@
   
 
 
+
+  dfs.datanode.netty.worker.threads
+  10

Review Comment:
   I don't have a strong opinion here. Either way works for me. 





> DataNode httpServer has too many worker threads
> ---
>
> Key: HDFS-17254
> URL: https://issues.apache.org/jira/browse/HDFS-17254
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Liangjun He
>Assignee: Liangjun He
>Priority: Minor
>  Labels: pull-request-available
>
> When optimizing the thread number of high-density storage DN, we found the 
> number of worker threads for the DataNode httpServer is twice the number of 
> available cores on node , resulting in too many threads. We can change this 
> to be configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800243#comment-17800243
 ] 

Xiaoqiao He commented on HDFS-17299:


Connection meet some issues?

Seems we both have this same opinion. But I don't have idea to fix it smooth. 
Because at NameNode side it doesn't recognise the dead node/racks in time, at 
Client side it doesn't know how many racks in the cluster. Any ideas?

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> 

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800241#comment-17800241
 ] 

Xiaoqiao He commented on HDFS-17299:


[~shahrs87] Please reference here: 
https://github.com/apache/hadoop/blob/415e9bdfbdeebded520e0233bcb91a487411a94b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1641
IMO, if there are only two racks in cluster where one of them is out of 
service, the writer will be failure always by default configuration. I think we 
should fix this corner case issue. Let's wait if [~ayushtkn] could give any 
other suggestions.

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
>   

[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800240#comment-17800240
 ] 

Ayush Saxena commented on HDFS-17299:
-

[~shahrs87] that config kicks in for post pipeline setup, not while creating 
one. So, I think your failure is during create itself.

[https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1571-L1573]

 

It won't reach here in your case since the pipeline wasn't setup, so nodes will 
be null here. 

[https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1455]

 

Which I feel is a bug or atleast warrants for some improvements. :(

 

The end solution is like go ahead with 2 nodes in pipeline, how to reach there 
we can figure out, mostly it should be via the 
ReplaceDatanodeOnFailure, but we can figure out.
 
[~hexiaoqiao] The case is like for Default BPP, it would be like 2 racks & one 
rack down, but the Namenode didn't recognise the rack as down period
 
But here the mentioned case if for rack fault tolerant BPP, 3 racks, 
replication factor 3 & 1 rack down, but the NN doesn't recognise that as dead, 
so it always tries to allocate node from all 3 racks, though 1 rack is dead & 
the create never succeeds, I have added a patch with a repro test, can give a 
check (a very quick patch, maybe wrong).
 
interesting problem :) 

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> 

[jira] [Updated] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-17299:

Attachment: repro.patch

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
> Attachments: repro.patch
>
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,369 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874-NN-IP-1594838129323:blk_1214652580_140946764
> 2023-12-16 17:17:44,454 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[AZ-2-dn-2:50010,DS-46bb45cc-af89-46f3-9f9d-24e4fdc35b6d,DISK]
> 2023-12-16 17:17:44,522 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652594_140946796, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,712 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,712 WARN  [Thread-39087] hdfs.DataStreamer - 

[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800238#comment-17800238
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

haiyang1987 commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435938037


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   Agree with @ayushtkn , the current works as designed, if don't specify any 
policy the result is if validate all the enabled policies.
   maybe the current ticket only fix verifyClusterSetup an invalid param output 
,thanks~





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. *Configuration is inconvenient and error-prone*: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. *Extension ns is not supported*: During the running of the router, if a new 
ns is added to the cluster and a mount is added for the ns, but because no 
handler is allocated for the ns, the ns cannot be accessed through the router. 
We must reconfigure the number of handlers and then refresh the configuration. 
At this time, the router can access the ns normally. When we reconfigure the 
number of handlers, we have to face disadvantage 1: Configuration is 
inconvenient and error-prone.

3. *Cannot share handlers*:  



  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. Configuration is inconvenient and error-prone: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. Extension ns is not supported: During the running of the router, if a new ns 
is added to the cluster and a mount is added for the ns, but because no handler 
is allocated for the ns, the ns cannot be accessed through the router. We must 
reconfigure the number of handlers and then refresh the configuration. At this 
time, the router can access the ns normally. When we reconfigure the number of 
handlers, we have to face disadvantage 1: Configuration is inconvenient and 
error-prone.

3. Cannot share handlers: 




> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:

1. Configuration is inconvenient and error-prone: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. Extension ns is not supported: During the running of the router, if a new ns 
is added to the cluster and a mount is added for the ns, but because no handler 
is allocated for the ns, the ns cannot be accessed through the router. We must 
reconfigure the number of handlers and then refresh the configuration. At this 
time, the router can access the ns normally. When we reconfigure the number of 
handlers, we have to face disadvantage 1: Configuration is inconvenient and 
error-prone.

3. Cannot share handlers: 



  was:
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:
1. Configuration is inconvenient and error-prone: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. Extension ns is not supported: During the running of the router, if a new ns 
is added to the cluster and a mount is added for the ns, but because no handler 
is allocated for the ns, the ns cannot be accessed through the router. We must 
reconfigure the number of handlers and then refresh the configuration. At this 
time, the router can access the ns normally. When we reconfigure the number of 
handlers, we have to face disadvantage 1: Configuration is inconvenient and 
error-prone.

3. Cannot share handlers: 




> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> 

[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load. But the 
StaticRouterRpcFairnessPolicyController still falls short in many ways, such as:
1. Configuration is inconvenient and error-prone: When I use 
StaticRouterRpcFairnessPolicyController, I first need to know how many handlers 
the router has in total, then I have to know how many nameservices the router 
currently has, and then carefully calculate how many handlers to allocate to 
each ns so that the sum of handlers for all ns will not exceed the total 
handlers of the router, and I also need to consider how many handlers to 
allocate to each ns to achieve better performance. Therefore, I need to be very 
careful when configuring. Even if I configure only one more handler for a 
certain ns, the total number is more than the number of hadnlers owned by the 
router, which will also cause the router to fail to start. At this time, I had 
to investigate the reason why the router failed to start. After finding the 
reason, I had to reconsider the number of handlers for each ns.

2. Extension ns is not supported: During the running of the router, if a new ns 
is added to the cluster and a mount is added for the ns, but because no handler 
is allocated for the ns, the ns cannot be accessed through the router. We must 
reconfigure the number of handlers and then refresh the configuration. At this 
time, the router can access the ns normally. When we reconfigure the number of 
handlers, we have to face disadvantage 1: Configuration is inconvenient and 
error-prone.

3. Cannot share handlers: 



  was:[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] which 
provides a StaticRouterRpcFairnessPolicyController to support configuring 
different handlers for different ns. Using the 
StaticRouterRpcFairnessPolicyController allows the router to isolate different 
ns, and the ns with a higher load will not affect the router's access to the ns 
with a normal load.


> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load. But the 
> StaticRouterRpcFairnessPolicyController still falls short in many ways, such 
> as:
> 1. Configuration is inconvenient and error-prone: When I use 
> StaticRouterRpcFairnessPolicyController, I first need to know how many 
> handlers the router has in total, then I have to know how many nameservices 
> the router currently has, and then carefully calculate how many handlers to 
> allocate to each ns so that the sum of handlers for all ns will not exceed 
> the total handlers of the router, and I also need to consider how many 
> handlers to allocate to each ns to achieve better performance. Therefore, I 
> need to be very careful when configuring. Even if I configure only one more 
> handler for a certain ns, the total number is more than the number of 
> hadnlers owned by the router, which will also cause the router to fail to 
> start. At this time, I had to investigate the reason why the router failed to 
> start. After finding the reason, I had to reconsider the number of handlers 
> for each ns.
> 2. Extension ns is not supported: During the running of the router, if a new 
> ns is added to the cluster and a mount is added for the ns, but because no 
> handler is allocated for the ns, the ns cannot be accessed through the 
> router. We must reconfigure the number of handlers and then refresh the 
> configuration. At this time, the router can access the ns normally. When we 
> reconfigure the number of handlers, we have to face disadvantage 1: 
> Configuration is inconvenient and error-prone.
> 

[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800236#comment-17800236
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435934977


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   Thx, I understand. It is necessary to modify the protocol in order to 
support policy level validation.





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800235#comment-17800235
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

ayushtkn commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435934137


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   If you pass multiple policies, that means you want a combined result, like 
if all these are supported or not in the "cluster", if you want one policy, 
pass one policy only.
   
   The whole design is to verify things at cluster level, not at policy level. 
To highlight cluster level setup issues, like all the enabled policies aren't 
supported & things like that, it was created for cluster admin level usage.
   
   you can add an additional option which tells the result per policy if the 
additional option is provided, in that case post getting the result, it can 
loop over the policies and get individual result, changing proto & all is like 
last things to do, but I don't think it is required as of now.





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] 
: Improved isolation for downstream name nodes, which provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load.

> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] : Improved 
> isolation for downstream name nodes, which provides a 
> StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Description: [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] 
which provides a StaticRouterRpcFairnessPolicyController to support configuring 
different handlers for different ns. Using the 
StaticRouterRpcFairnessPolicyController allows the router to isolate different 
ns, and the ns with a higher load will not affect the router's access to the ns 
with a normal load.  (was: 
[HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] : Improved 
isolation for downstream name nodes, which provides a 
StaticRouterRpcFairnessPolicyController to support configuring different 
handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
allows the router to isolate different ns, and the ns with a higher load will 
not affect the router's access to the ns with a normal load.)

> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>
> [HDFS-14090|https://issues.apache.org/jira/browse/HDFS-14090] which provides 
> a StaticRouterRpcFairnessPolicyController to support configuring different 
> handlers for different ns. Using the StaticRouterRpcFairnessPolicyController 
> allows the router to isolate different ns, and the ns with a higher load will 
> not affect the router's access to the ns with a normal load.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Attachment: HDFS-17302.002.patch

> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch, HDFS-17302.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Zhang updated HDFS-17302:
--
Attachment: HDFS-17302.001.patch
Status: Patch Available  (was: Open)

> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-17302.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800231#comment-17800231
 ] 

ASF GitHub Bot commented on HDFS-17302:
---

KeeProMise opened a new pull request, #6380:
URL: https://github.com/apache/hadoop/pull/6380

   
   
   ### Description of PR
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17302:
--
Labels: pull-request-available  (was: )

> RBF: ProportionRouterRpcFairnessPolicyController-support proportional 
> allocation of semaphores
> --
>
> Key: HDFS-17302
> URL: https://issues.apache.org/jira/browse/HDFS-17302
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: rbf
>Reporter: Jian Zhang
>Assignee: Jian Zhang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17302) RBF: ProportionRouterRpcFairnessPolicyController-support proportional allocation of semaphores

2023-12-24 Thread Jian Zhang (Jira)
Jian Zhang created HDFS-17302:
-

 Summary: RBF: ProportionRouterRpcFairnessPolicyController-support 
proportional allocation of semaphores
 Key: HDFS-17302
 URL: https://issues.apache.org/jira/browse/HDFS-17302
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: rbf
Reporter: Jian Zhang
Assignee: Jian Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800223#comment-17800223
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435902418


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   @ayushtkn Thx for your Q, this is the design of the API itself.
   However, the combined message cannot clearly explain the reasons why per EC 
policy is not supported.





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17299) HDFS is not rack failure tolerant while creating a new file.

2023-12-24 Thread Rushabh Shah (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800217#comment-17800217
 ] 

Rushabh Shah commented on HDFS-17299:
-

> Maybe if you would have put 
> dfs.client.block.write.replace-datanode-on-failure.enable as false, it 
> wouldn't have tried to replace the DN itself & went ahead with 2 DN from 
> other AZ?

It is entirely possible that I am not reading the code right. I am little bit 
out of sync with the DataStreamer codebase.
But I don't see this config property 
dfs.client.block.write.replace-datanode-on-failure.enable being used anywhere 
in the PIPELINE_SETUP_CREATE phase.
I am looking at the branch-2.10 branch. This is the code flow. 
[DataStreamer#run()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L708-L711]
 --> 
[nextBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1655]
 --> 
[createBlockOutputStream()|https://github.com/apache/hadoop/blob/branch-2.10/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1751]
 

There is a retry within nextBlockOutputStream via 
dfs.client.block.write.retries but it doesn't take 
dfs.client.block.write.replace-datanode-on-failure.enable in consideration.
Cc [~ayushtkn] [~hexiaoqiao]

> HDFS is not rack failure tolerant while creating a new file.
> 
>
> Key: HDFS-17299
> URL: https://issues.apache.org/jira/browse/HDFS-17299
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1
>Reporter: Rushabh Shah
>Priority: Critical
>
> Recently we saw an HBase cluster outage when we mistakenly brought down 1 AZ.
> Our configuration:
> 1. We use 3 Availability Zones (AZs) for fault tolerance.
> 2. We use BlockPlacementPolicyRackFaultTolerant as the block placement policy.
> 3. We use the following configuration parameters: 
> dfs.namenode.heartbeat.recheck-interval: 60 
> dfs.heartbeat.interval: 3 
> So it will take 123 ms (20.5mins) to detect that datanode is dead.
>  
> Steps to reproduce:
>  # Bring down 1 AZ.
>  # HBase (HDFS client) tries to create a file (WAL file) and then calls 
> hflush on the newly created file.
>  # DataStreamer is not able to find blocks locations that satisfies the rack 
> placement policy (one copy in each rack which essentially means one copy in 
> each AZ)
>  # Since all the datanodes in that AZ are down but still alive to namenode, 
> the client gets different datanodes but still all of them are in the same AZ. 
> See logs below.
>  # HBase is not able to create a WAL file and it aborts the region server.
>  
> Relevant logs from hdfs client and namenode
>  
> {noformat}
> 2023-12-16 17:17:43,818 INFO  [on default port 9000] FSNamesystem.audit - 
> allowed=trueugi=hbase/ (auth:KERBEROS) ip=  
> cmd=create  src=/hbase/WALs/  dst=null
> 2023-12-16 17:17:43,978 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652565_140946716, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,061 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1651)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:715)
> 2023-12-16 17:17:44,061 WARN  [Thread-39087] hdfs.DataStreamer - Abandoning 
> BP-179318874--1594838129323:blk_1214652565_140946716
> 2023-12-16 17:17:44,179 WARN  [Thread-39087] hdfs.DataStreamer - Excluding 
> datanode 
> DatanodeInfoWithStorage[:50010,DS-a493abdb-3ac3-49b1-9bfb-848baf5c1c2c,DISK]
> 2023-12-16 17:17:44,339 INFO  [on default port 9000] hdfs.StateChange - 
> BLOCK* allocate blk_1214652580_140946764, replicas=:50010, 
> :50010, :50010 for /hbase/WALs/
> 2023-12-16 17:17:44,369 INFO  [Thread-39087] hdfs.DataStreamer - Exception in 
> createBlockOutputStream
> java.io.IOException: Got error, status=ERROR, status message , ack with 
> firstBadLink as :50010
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:113)
> at 
> org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1747)
> at 
> 

[jira] [Commented] (HDFS-17301) Add read and write dataXceiver threads count metrics to datanode.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800201#comment-17800201
 ] 

ASF GitHub Bot commented on HDFS-17301:
---

hadoop-yetus commented on PR #6377:
URL: https://github.com/apache/hadoop/pull/6377#issuecomment-1868572225

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  1s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  13m 47s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  35m 29s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  18m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  16m 28s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   4m 38s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 12s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 28s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 40s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m  1s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m 34s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m  6s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  16m 26s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 34s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m  8s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 23s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 32s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m 20s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 30s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  19m 11s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 251m  1s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  5s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 513m  4s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6377/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6377 |
   | Optional Tests | dupname asflicense mvnsite codespell detsecrets 
markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs 
checkstyle |
   | uname | Linux d3840b787b54 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / ecb37461d2ce1ebe233aa07f552d40c1cf9a1e42 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6377/4/testReport/ |
   | Max. process+thread count | 2804 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 

[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800200#comment-17800200
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

ayushtkn commented on code in PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#discussion_r1435864191


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/ECAdmin.java:
##
@@ -642,6 +642,10 @@ public int run(Configuration conf, List args) 
throws IOException {
   throw e;
 }
   } else {
+if (args.size() > 0) {
+  System.err.println(getName() + ": Too many arguments");

Review Comment:
   It works as designed, if you don't specify any policy the result is if all 
the policies enabled can work or not & the minimum number of datanodes required 
for each of them.
   
   It is validating cluster setup for the enabled policies, not per policy 
setup.
   
   if you specify multiple policies, it provides the aggregate result, like if 
all those policies can work on the cluster or not & if not a combined message 
like how man DN are required.
   like
   ```
   9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
XOR-2-1-1024k. The number of DataNodes is only 3. 
   ```
   
   Like here, 9 is required number because of 6-3.
   
   If we intend to improve the behaviour we can think if there is a need





> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800196#comment-17800196
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

hfutatzhanghb commented on code in PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#discussion_r1435849521


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java:
##
@@ -536,8 +536,13 @@ protected void adjustChunkBoundary() {
 }
 
 if (!getStreamer().getAppendChunk()) {
-  final int psize = (int) Math
-  .min(blockSize - getStreamer().getBytesCurBlock(), writePacketSize);
+  int psize = 0;
+  if (blockSize == getStreamer().getBytesCurBlock()) {

Review Comment:
   @Hexiaoqiao Sir, thanks for your replying. I will add unit tests soonly when 
i am avaliable.





> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17273) Improve local variables duration of DataStreamer for better debugging.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800182#comment-17800182
 ] 

ASF GitHub Bot commented on HDFS-17273:
---

hfutatzhanghb commented on PR #6321:
URL: https://github.com/apache/hadoop/pull/6321#issuecomment-1868518216

   > Committed to trunk. Thanks @hfutatzhanghb , @haiyang1987 and @tomscut .
   
   @Hexiaoqiao Sir, thanks a lot for your merging and reviewing. Thanks 
@haiyang1987 @tomscut for reviewing carefully.  




>  Improve local variables duration of DataStreamer for better debugging.
> ---
>
> Key: HDFS-17273
> URL: https://issues.apache.org/jira/browse/HDFS-17273
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800168#comment-17800168
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#issuecomment-1868495316

   > May be necessary to modify the erasurecoding.proto file, `required` to 
`repeated`, similar to `DistributedFileSystem#addErasureCodingPolicies`.
   > 
   > This modification will result in a many to many relationship.
   > 
   > ```proto
   > // now
   > message GetECTopologyResultForPoliciesResponseProto {
   >   required ECTopologyVerifierResultProto response = 1;
   > }
   > 
   > // expect
   > message GetECTopologyResultForPoliciesResponseProto {
   >   repeated ECTopologyVerifierResultProto response = 1;
   > }
   > ```
   
   Hi @ayushtkn @haiyang1987, If you have time, let's discuss it together, 
Thanks.




> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800165#comment-17800165
 ] 

ASF GitHub Bot commented on HDFS-17056:
---

huangzhaobo99 commented on PR #6379:
URL: https://github.com/apache/hadoop/pull/6379#issuecomment-1868494355

   May be necessary to modify the erasurecoding.proto file, `required` to 
`repeated`, 
   similar to `DistributedFileSystem#addErasureCodingPolicies`. 
   
   This modification will result in a many to many relationship.
   ```proto
   // now
   message GetECTopologyResultForPoliciesResponseProto {
 required ECTopologyVerifierResultProto response = 1;
   }
   
   // expect
   message GetECTopologyResultForPoliciesResponseProto {
 repeated ECTopologyVerifierResultProto response = 1;
   }
   ```




> EC: Fix verifyClusterSetup output in case of an invalid param.
> --
>
> Key: HDFS-17056
> URL: https://issues.apache.org/jira/browse/HDFS-17056
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Ayush Saxena
>Assignee: huangzhaobo99
>Priority: Major
>  Labels: newbie, pull-request-available
>
> {code:java}
> bin/hdfs ec  -verifyClusterSetup XOR-2-1-1024k        
> 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, 
> XOR-2-1-1024k. The number of DataNodes is only 3. {code}
> verifyClusterSetup requires -policy then the name of policies, else it 
> defaults to all enabled policies.
> In case there are additional invalid options it silently ignores them, unlike 
> other EC commands which throws out Too Many Argument exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17301) Add read and write dataXceiver threads count metrics to datanode.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800154#comment-17800154
 ] 

ASF GitHub Bot commented on HDFS-17301:
---

hadoop-yetus commented on PR #6377:
URL: https://github.com/apache/hadoop/pull/6377#issuecomment-1868473774

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  11m 40s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  markdownlint  |   0m  0s |  |  markdownlint was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  14m 19s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 57s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |  17m 10s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |  15m 58s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   5m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   3m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   2m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 28s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  17m 40s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |  17m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |  16m 51s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |  16m 51s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   4m 39s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   3m 10s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   2m 16s |  |  the patch passed with JDK 
Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   2m 34s |  |  the patch passed with JDK 
Private Build-1.8.0_392-8u392-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   6m 47s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m 26s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  19m 44s |  |  hadoop-common in the patch 
passed.  |
   | +1 :green_heart: |  unit  | 216m 36s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   1m  6s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 485m 45s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6377/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6377 |
   | Optional Tests | dupname asflicense mvnsite codespell detsecrets 
markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs 
checkstyle |
   | uname | Linux e59912d06f26 5.15.0-88-generic #98-Ubuntu SMP Mon Oct 2 
15:18:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / e65f7a52ff6f6013bfaebe0f6dcb04bd13c78d30 |
   | Default Java | Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.21+9-post-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_392-8u392-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6377/3/testReport/ |
   | Max. process+thread count | 3699 (vs. ulimit of 5500) |
   | modules | C: hadoop-common-project/hadoop-common 
hadoop-hdfs-project/hadoop-hdfs U: . |
   | Console output | 

[jira] [Commented] (HDFS-17293) First packet data + checksum size will be set to 516 bytes when writing to a new block.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800151#comment-17800151
 ] 

ASF GitHub Bot commented on HDFS-17293:
---

Hexiaoqiao commented on code in PR #6368:
URL: https://github.com/apache/hadoop/pull/6368#discussion_r1435791414


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java:
##
@@ -536,8 +536,13 @@ protected void adjustChunkBoundary() {
 }
 
 if (!getStreamer().getAppendChunk()) {
-  final int psize = (int) Math
-  .min(blockSize - getStreamer().getBytesCurBlock(), writePacketSize);
+  int psize = 0;
+  if (blockSize == getStreamer().getBytesCurBlock()) {

Review Comment:
   Not think carefully, but it seems good improvement for my first feeling. cc 
@zhangshuyan0 Would you mind to take another review?
   @hfutatzhanghb it is better to add unit tests to cover this case. Thanks.





> First packet data + checksum size will be set to 516 bytes when writing to a 
> new block.
> ---
>
> Key: HDFS-17293
> URL: https://issues.apache.org/jira/browse/HDFS-17293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.6
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>
> First packet size will be set to 516 bytes when writing to a new block.
> In  method computePacketChunkSize, the parameters psize and csize would be 
> (0, 512)
> when writting to a new block. It should better use writePacketSize.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17280) Pipeline recovery should better end block in advance when bytes acked greater than half of blocksize.

2023-12-24 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17800150#comment-17800150
 ] 

ASF GitHub Bot commented on HDFS-17280:
---

Hexiaoqiao commented on PR #6336:
URL: https://github.com/apache/hadoop/pull/6336#issuecomment-1868468065

   @hfutatzhanghb Thanks for your contribution! Sorry I didn't get this 
proposal clearly. Would you mind to offer some more information about what 
issue do you meet, and what this PR could do? Thanks again.




> Pipeline recovery should better end block in advance when bytes acked greater 
> than half of blocksize.
> -
>
> Key: HDFS-17280
> URL: https://issues.apache.org/jira/browse/HDFS-17280
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: farmmamba
>Assignee: farmmamba
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org