[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688874#comment-17688874
 ] 

ASF GitHub Bot commented on HDFS-16922:
---

hfutatzhanghb commented on PR #5398:
URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1430888618

   > Requires a UT which can reproduce the said issue
   
   Hi, @ayushtkn ,  I have updated the description of this issue. please take a 
look~  thanks a lot.




> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16922
> URL: https://issues.apache.org/jira/browse/HDFS-16922
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16921) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhangHB resolved HDFS-16921.

Resolution: Duplicate

> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16921
> URL: https://issues.apache.org/jira/browse/HDFS-16921
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Critical
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16920) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhangHB resolved HDFS-16920.

Resolution: Duplicate

> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16920
> URL: https://issues.apache.org/jira/browse/HDFS-16920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Critical
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16919) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhangHB resolved HDFS-16919.

Resolution: Duplicate

> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16919
> URL: https://issues.apache.org/jira/browse/HDFS-16919
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Critical
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688863#comment-17688863
 ] 

ASF GitHub Bot commented on HDFS-16922:
---

hfutatzhanghb commented on PR #5398:
URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1430844228

   > 
   
   hello, @ayushtkn . OK, i will try to construct a UT to reproduce this issue, 
and I will try to describe the issue in this page.




> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16922
> URL: https://issues.apache.org/jira/browse/HDFS-16922
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688861#comment-17688861
 ] 

ASF GitHub Bot commented on HDFS-16922:
---

hfutatzhanghb opened a new pull request, #5398:
URL: https://github.com/apache/hadoop/pull/5398

   The current logic of IncrementalBlockReportManager# addRDBI method could 
lead to the missing blocks when datanodes in pipeline are I/O busy.




> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16922
> URL: https://issues.apache.org/jira/browse/HDFS-16922
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: ZhangHB
>Priority: Major
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16922:
--
Labels: pull-request-available  (was: )

> The logic of IncrementalBlockReportManager#addRDBI method may cause missing 
> blocks when cluster is busy.
> 
>
> Key: HDFS-16922
> URL: https://issues.apache.org/jira/browse/HDFS-16922
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> The current logic of IncrementalBlockReportManager# addRDBI method could lead 
> to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688859#comment-17688859
 ] 

ASF GitHub Bot commented on HDFS-16914:
---

tomscut commented on code in PR #5381:
URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106710999


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block,
 }
 // Ensure we record the new generation stamp
 getEditLog().logSync();
+LOG.info("bumpBlockGenerationStamp({}, client={}) success",
+locatedBlock.getBlock(), clientName);

Review Comment:
   > @tomscut We record block information, will there be a lot of logs? Is it 
changed to debug?
   
   I was worried about this before, but through offline discussions and testing 
with @hfutatzhanghb , we found that the logs were not too frequent.
   





> Add some logs for updateBlockForPipeline RPC.
> -
>
> Key: HDFS-16914
> URL: https://issues.apache.org/jira/browse/HDFS-16914
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Minor
>  Labels: pull-request-available
>
> Recently,we received an phone alarm about missing blocks.  We found logs in 
> one datanode where the block was placed on  like below:
>  
> {code:java}
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: 
> /clientAddress:44638 dest: /localAddress:50010 of size 45733720
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: 
> /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code}
> the datanode received the same block with different generation stamp because 
> of socket timeout exception.  blk_1305044966_231826462 is received from 
> upstream datanode in pipeline which has two datanodes.  
> blk_1305044966_231832415 is received from client directly.   
>  
> we have search all log info about blk_1305044966 in namenode and three 
> datanodes in original pipeline. but we could not obtain any helpful message 
> about the generation stamp 231826462.  After diving into the source code,  it 
> was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in 
> DataStreamer#setupPipelineInternal.   The updateBlockForPipeline RPC does not 
> have any log info. So I think we should add some logs in this RPC.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)
ZhangHB created HDFS-16922:
--

 Summary: The logic of IncrementalBlockReportManager#addRDBI method 
may cause missing blocks when cluster is busy.
 Key: HDFS-16922
 URL: https://issues.apache.org/jira/browse/HDFS-16922
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: ZhangHB


The current logic of IncrementalBlockReportManager# addRDBI method could lead 
to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16921) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)
ZhangHB created HDFS-16921:
--

 Summary: The logic of IncrementalBlockReportManager#addRDBI method 
may cause missing blocks when cluster is busy.
 Key: HDFS-16921
 URL: https://issues.apache.org/jira/browse/HDFS-16921
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.3.4
Reporter: ZhangHB


The current logic of IncrementalBlockReportManager# addRDBI method could lead 
to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16920) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)
ZhangHB created HDFS-16920:
--

 Summary: The logic of IncrementalBlockReportManager#addRDBI method 
may cause missing blocks when cluster is busy.
 Key: HDFS-16920
 URL: https://issues.apache.org/jira/browse/HDFS-16920
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.3.4
Reporter: ZhangHB


The current logic of IncrementalBlockReportManager# addRDBI method could lead 
to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16919) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.

2023-02-14 Thread ZhangHB (Jira)
ZhangHB created HDFS-16919:
--

 Summary: The logic of IncrementalBlockReportManager#addRDBI method 
may cause missing blocks when cluster is busy.
 Key: HDFS-16919
 URL: https://issues.apache.org/jira/browse/HDFS-16919
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Affects Versions: 3.3.4
Reporter: ZhangHB


The current logic of IncrementalBlockReportManager# addRDBI method could lead 
to the missing blocks when datanodes in pipeline are I/O busy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688857#comment-17688857
 ] 

ASF GitHub Bot commented on HDFS-16918:
---

virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430828452

   > If the datanode is connected to observer namenode, it can serve requests, 
why we need to shutdown
   
   The observer namenode takes a different condition. I was actually thinking 
about making this include observer namenode too i.e. if datanode has not 
received heartbeat from observer or active namenode in the last e.g. 30s or so, 
then it should shutdown. This is an option, no issues with it.
   
   
   > Even if it is connected to standby, a failover happens and it will be in 
good shape, else if you restart a bunch of datanodes, the new namenode will be 
flooded by block reports and just increasing problems.
   
   This problem would occur only if we select reasonably lower number. The 
recommendation for this config value is high enough to include extra time 
duration for namenode failover.
   
   
   > If something gets messed up with Active namenode, you shutdown all, the BR 
are already heavy, you forced all other namenodes to handle them again, making 
failover more difficult. and if it is some faulty datanodes which lost 
connection, you didn't get that alarmed, and all Standby and Observers will 
keep on getting flooded by BRs, so in case Active NN literally dies and tries 
to failover to any of the Namenode which these Datanodes were connected, will 
be fed with unnecessary loads of BlockReports. (BR has an option of initial 
delay as well, it isn't like all bombard at once and you are sorted in 5-10 
mins)
   
   The moment when active namenode becomes messy, or dies, this is exactly what 
can impact the availability of the hdfs service. So either we have Observer 
namenode take care of read requests in the meantime or the failover needs to 
happen. If neither of that happens, it's the datanode that is not really useful 
by staying the in cluster for longer duration. Let's say namenode gets bad and 
failover does take time, the new active one is anyways going to take time 
processing BRs right?
   
   
   > If something got messed with the datanode, that is why it isn't able to 
connect to Active. If something is in Memory not persisted to disk, or some JMX 
parameter or N/W parameters which can be used to figure out things gets lost.
   
   Do you mean hsync vs hflush kind of thing for in prgress files? Is that not 
already taken care of?
   
   
   > That is the reason most cluster administrator in not so cool situations, 
show XYZ datanode is unhealthy or not, if in some case they don't it should be 
handled over there.
   
   The response would take time from the cluster admin applications. Why not 
get auto healed by datanode? Also it's not that this change is going to 
terminate the datanode, it's going to shut down properly.
   
   
   > In case of shared datanodes in a federated setup, say it is connected to 
Active for one Namespace and has completely lost touch with another, then? 
Restart to get both working? Don't restart so that at least one stays working? 
Both are correct in there own ways and situation and the datanode shouldn't be 
in a state to decide its fate for such reasons.
   
   IMO any namespace that is not connected to active namenode is not up for 
serving requests from active namenode and hence it's not in good state. I got 
your point but the health of a datanode should be determined based on whether 
all BPs are connected to active in the federated setup, is that not the real 
factor determining the health of datanode?
   
   
   > Making anything configurable doesn't justify having it in. if we are 
letting any user to use this via any config as well, then we should be sure 
enough it is necessary and good thing to do, we can not say ohh you configured 
it, now it is your problem...
   
   I am not making claim only based on making this configurable feature. But it 
is reasonable enough to determine best course of action for given situation. 
The only recommendation I have is: user should be able to get the datanode to 
decide whether it should shutdown gracefully when it has not heard anything 
from active or observer namenode for the past x sec (50/60s or so).
   I have tried my best to answer above questions. Please also take a look at 
the Jira/PR description where this idea has been taken from. We have seen 
issues with specific infra and until manually shutting down datanodes, we don't 
see any hope for improving availability, this has happened at multiple times.
   
   Please keep in mind that cluster administrators in cloud native env do not 
have access to JMX metrics due to the security constraints.
   
   Really appreciate all your points and suggestions Ayush, please take a look 
again.




> 

[jira] [Updated] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16917:
--
Labels: pull-request-available  (was: )

> Add transfer rate quantile metrics for DataNode reads
> -
>
> Key: HDFS-16917
> URL: https://issues.apache.org/jira/browse/HDFS-16917
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: datanode
>Reporter: Ravindra Dingankar
>Priority: Minor
>  Labels: pull-request-available
>
> Currently we have the following metrics for datanode reads.
> |BytesRead
> BlocksRead
> TotalReadTime|Total number of bytes read from DataNode
> Total number of blocks read from DataNode
> Total number of milliseconds spent on read operation|
> We would like to add a new quantile metric calculating the distribution of 
> data transfer rate for datanode reads.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688856#comment-17688856
 ] 

ASF GitHub Bot commented on HDFS-16917:
---

rdingankar opened a new pull request, #5397:
URL: https://github.com/apache/hadoop/pull/5397

   
   
   ### Description of PR
   Transfer rate metric for datanode reads will be calculated as the rate at 
which bytes are read ( bytes per ms )
   With quantiles we will get a distribution of this rate which will be helpful 
in identifying slow datanodes.
   
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ Y] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ NA] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ NA] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ NA] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Add transfer rate quantile metrics for DataNode reads
> -
>
> Key: HDFS-16917
> URL: https://issues.apache.org/jira/browse/HDFS-16917
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: datanode
>Reporter: Ravindra Dingankar
>Priority: Minor
>
> Currently we have the following metrics for datanode reads.
> |BytesRead
> BlocksRead
> TotalReadTime|Total number of bytes read from DataNode
> Total number of blocks read from DataNode
> Total number of milliseconds spent on read operation|
> We would like to add a new quantile metric calculating the distribution of 
> data transfer rate for datanode reads.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688840#comment-17688840
 ] 

ASF GitHub Bot commented on HDFS-16918:
---

ayushtkn commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430800093

   From Admin I mean cluster Administrator services, they can keep a track of 
datanodes and decide on what needs to be done to the datanode.
   If those services can shoot a restart if the datanode is shutdown, they can 
track in which situation the datanode needs to restarted.
   
   Not checking the code, but comments:
   
   - If the datanode is connected to observer namenode, it can serve requests, 
why we need to shutdown,
   - Even if it is connected to standby, a failover happens and it will be in 
good shape, else if you restart a bunch of datanodes, the new namenode will be 
flooded by block reports and just increasing problems.
   - If something gets messed up with Active namenode, you shutdown all, the BR 
are already heavy, you forced all other namenodes to handle them again, making 
failover more difficult. and if it is some faulty datanodes which lost 
connection, you didn't get that alarmed, and all Standby and Observers will 
keep on getting flooded by BRs, so in case Active NN literally dies and tries 
to failover to any of the Namenode which these Datanodes were connected, will 
be fed with unnecessary loads of BlockReports. (BR has an option of initial 
delay as well, it isn't like all bombard at once and you are sorted in 5-10 
mins)
   - If something got messed with the datanode, that is why it isn't able to 
connect to Active. If something is in Memory not persisted to disk, or some JMX 
parameter or N/W parameters which can be used to figure out things gets lost.
   - That is the reason most cluster administrator in not so cool situations, 
show XYZ datanode is unhealthy or not, if in some case they don't it should be 
handled over there.
   - In case of shared datanodes in a federated setup, say it is connected to 
Active for one Namespace and has completely lost touch with another, then? 
Restart to get both working? Don't restart so that at least one stays working? 
Both are correct in there own ways and situation and the datanode shouldn't be 
in a state to decide its fate for such reasons.
   
   We do terminate Namenode is a bunch of conditions for sure, I don't want to 
get deep into those reasons, it is more or less preventive measure to terminate 
Namenode, if something serious has happened. This by architecture of HDFS 
itself isn't look very valid for HDFS.
   
   PS. Making anything configurable doesn't justify having it in. if we are 
letting any user to use this via any config as well, then we should be sure 
enough it is necessary and good thing to do, we can not say ohh you configured 
it, now it is your problem...
   
   I would say it is just pulling those cluster administrator things to 
datanode, like what Cloudera Manager or may be Ambari should do.
   
   Not in favour of this...




> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part 

[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688827#comment-17688827
 ] 

ASF GitHub Bot commented on HDFS-16918:
---

virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430775234

   For this change, the entire behavior is optional:
   
   ```
 
   dfs.datanode.health.activennconnect.timeout
   0
   
 If the value is greater than 0, each datanode would try to determine 
if it is healthy i.e.
 all block pools are correctly initialized and able to heartbeat to 
active namenode. At any
 given time, if the datanode looses connection to active namenode for 
the duration of
 milliseconds represented by the value of this config, it will attempt 
to shut down itself.
 If the value is 0, datanode would not perform any such checks.
   
 
   ```
   
   Without providing non-default value for this config, this behavior does not 
take any effect.




> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688828#comment-17688828
 ] 

ASF GitHub Bot commented on HDFS-16896:
---

hadoop-yetus commented on PR #5322:
URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1430775324

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 46s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 25s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   6m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   5m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   2m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   2m 18s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 55s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 40s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 41s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   5m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   5m 37s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   5m 37s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   1m  4s | 
[/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/results-checkstyle-hadoop-hdfs-project.txt)
 |  hadoop-hdfs-project: The patch generated 1 new + 31 unchanged - 0 fixed = 
32 total (was 31)  |
   | +1 :green_heart: |  mvnsite  |   2m 13s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   2m  3s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   5m 54s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 56s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |   2m 26s |  |  hadoop-hdfs-client in the patch 
passed.  |
   | -1 :x: |  unit  | 206m 23s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 51s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 360m  1s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs |
   |   | hadoop.hdfs.TestPread |
   |   | hadoop.hdfs.server.namenode.TestAuditLogger |
   |   | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport |
   |   | hadoop.hdfs.server.namenode.TestFsck |
   |   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5322 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux c1560d884f2a 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 

[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688826#comment-17688826
 ] 

ASF GitHub Bot commented on HDFS-16918:
---

virajjasani commented on PR #5396:
URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1430773119

   In the large fleet of datanodes, any datanode that does not stay connected 
to active namenode due to any connectivity issue, can choose to get itself 
shutdown rather than impacting the availability and that is wise thing for 
datanode itself to do, of course not mandatorily but as an opt-in behavior. 
   While admin can do that, no human interaction can be fast enough to take 
action in the large scale cluster in just a matter of few seconds.




> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16918:
--
Labels: pull-request-available  (was: )

> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688810#comment-17688810
 ] 

ASF GitHub Bot commented on HDFS-16918:
---

virajjasani opened a new pull request, #5396:
URL: https://github.com/apache/hadoop/pull/5396

   While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.
   
   The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.
   
   Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.




> Optionally shut down datanode if it does not stay connected to active namenode
> --
>
> Key: HDFS-16918
> URL: https://issues.apache.org/jira/browse/HDFS-16918
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
> configured at envoy, the network connection issues or packet loss could be 
> observed. All of envoys basically form a transparent communication mesh in 
> which each app can send and receive packets to and from localhost and is 
> unaware of the network topology.
> The primary purpose of Envoy is to make the network transparent to 
> applications, in order to identify network issues reliably. However, 
> sometimes such proxy based setup could result into socket connection issues 
> b/ datanode and namenode.
> Many deployment frameworks provide auto-start functionality when any of the 
> hadoop daemons are stopped. If a given datanode does not stay connected to 
> active namenode in the cluster i.e. does not receive heartbeat response in 
> time from active namenode (even though active namenode is not terminated), it 
> would not be much useful. We should be able to provide configurable behavior 
> such that if a given datanode cannot receive heartbeat response from active 
> namenode in configurable time duration, it should terminate itself to avoid 
> impacting the availability SLA. This is specifically helpful when the 
> underlying deployment or observability framework (e.g. K8S) can start up the 
> datanode automatically upon it's shutdown (unless it is being restarted as 
> part of rolling upgrade) and help the newly brought up datanode (in case of 
> k8s, a new pod with dynamically changing nodes) establish new socket 
> connection to active and standby namenodes. This should be an opt-in behavior 
> and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode

2023-02-14 Thread Viraj Jasani (Jira)
Viraj Jasani created HDFS-16918:
---

 Summary: Optionally shut down datanode if it does not stay 
connected to active namenode
 Key: HDFS-16918
 URL: https://issues.apache.org/jira/browse/HDFS-16918
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Viraj Jasani
Assignee: Viraj Jasani


While deploying Hdfs on Envoy proxy setup, depending on the socket timeout 
configured at envoy, the network connection issues or packet loss could be 
observed. All of envoys basically form a transparent communication mesh in 
which each app can send and receive packets to and from localhost and is 
unaware of the network topology.

The primary purpose of Envoy is to make the network transparent to 
applications, in order to identify network issues reliably. However, sometimes 
such proxy based setup could result into socket connection issues b/ datanode 
and namenode.

Many deployment frameworks provide auto-start functionality when any of the 
hadoop daemons are stopped. If a given datanode does not stay connected to 
active namenode in the cluster i.e. does not receive heartbeat response in time 
from active namenode (even though active namenode is not terminated), it would 
not be much useful. We should be able to provide configurable behavior such 
that if a given datanode cannot receive heartbeat response from active namenode 
in configurable time duration, it should terminate itself to avoid impacting 
the availability SLA. This is specifically helpful when the underlying 
deployment or observability framework (e.g. K8S) can start up the datanode 
automatically upon it's shutdown (unless it is being restarted as part of 
rolling upgrade) and help the newly brought up datanode (in case of k8s, a new 
pod with dynamically changing nodes) establish new socket connection to active 
and standby namenodes. This should be an opt-in behavior and not default one.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688787#comment-17688787
 ] 

ASF GitHub Bot commented on HDFS-16761:
---

ayushtkn commented on PR #5390:
URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1430673981

   Was just gonna hit the merge button, so thought of just trying and 
reproducing this issue locally as well. But it didn't repro
   https://user-images.githubusercontent.com/25608848/218913802-65e390ed-4f9e-445b-846a-f9dd1e8542d7.png;>
   The dead datanode row has one less column red, but it isn't going to the 
startup page as the original ticket mentioned. Anyone with any pointers?
   
   The change is this PR is still relevant I think, but just curious if the 
issue reported is correct or not, so we can change the title and then merge




> Namenode UI for Datanodes page not loading if any data node is down
> ---
>
> Key: HDFS-16761
> URL: https://issues.apache.org/jira/browse/HDFS-16761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Krishna Reddy
>Assignee: Zita Dombi
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
> - Install the hadoop components and add 3 datanodes
> - Enable namenode HA 
> - Open Namenode UI and check datanode page 
> - check all datanodes will display
> - Now make one datanode down
> - wait for 10 minutes time as heartbeat expires
> - Refresh namenode page and check
>  
> Actual Result: It is showing error message "NameNode is still loading. 
> Redirecting to the Startup Progress page."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down

2023-02-14 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-16761:

Fix Version/s: (was: 3.2.2)

> Namenode UI for Datanodes page not loading if any data node is down
> ---
>
> Key: HDFS-16761
> URL: https://issues.apache.org/jira/browse/HDFS-16761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Krishna Reddy
>Assignee: Zita Dombi
>Priority: Major
>  Labels: pull-request-available
>
> Steps to reproduce:
> - Install the hadoop components and add 3 datanodes
> - Enable namenode HA 
> - Open Namenode UI and check datanode page 
> - check all datanodes will display
> - Now make one datanode down
> - wait for 10 minutes time as heartbeat expires
> - Refresh namenode page and check
>  
> Actual Result: It is showing error message "NameNode is still loading. 
> Redirecting to the Startup Progress page."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads

2023-02-14 Thread Ravindra Dingankar (Jira)
Ravindra Dingankar created HDFS-16917:
-

 Summary: Add transfer rate quantile metrics for DataNode reads
 Key: HDFS-16917
 URL: https://issues.apache.org/jira/browse/HDFS-16917
 Project: Hadoop HDFS
  Issue Type: Task
  Components: datanode
Reporter: Ravindra Dingankar


Currently we have the following metrics for datanode reads.
|BytesRead
BlocksRead
TotalReadTime|Total number of bytes read from DataNode
Total number of blocks read from DataNode
Total number of milliseconds spent on read operation|

We would like to add a new quantile metric calculating the distribution of data 
transfer rate for datanode reads.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688751#comment-17688751
 ] 

ASF GitHub Bot commented on HDFS-16914:
---

hfutatzhanghb commented on code in PR #5381:
URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106495693


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block,
 }
 // Ensure we record the new generation stamp
 getEditLog().logSync();
+LOG.info("bumpBlockGenerationStamp({}, client={}) success",
+locatedBlock.getBlock(), clientName);

Review Comment:
   @slfan1989 hi, thanks for your review,  the frequency of 
bumpBlockGenerationStamp logs is approximately equal to the frequency of  
updatePipeline.  So, i think we can use INFO level here.





> Add some logs for updateBlockForPipeline RPC.
> -
>
> Key: HDFS-16914
> URL: https://issues.apache.org/jira/browse/HDFS-16914
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Minor
>  Labels: pull-request-available
>
> Recently,we received an phone alarm about missing blocks.  We found logs in 
> one datanode where the block was placed on  like below:
>  
> {code:java}
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: 
> /clientAddress:44638 dest: /localAddress:50010 of size 45733720
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: 
> /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code}
> the datanode received the same block with different generation stamp because 
> of socket timeout exception.  blk_1305044966_231826462 is received from 
> upstream datanode in pipeline which has two datanodes.  
> blk_1305044966_231832415 is received from client directly.   
>  
> we have search all log info about blk_1305044966 in namenode and three 
> datanodes in original pipeline. but we could not obtain any helpful message 
> about the generation stamp 231826462.  After diving into the source code,  it 
> was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in 
> DataStreamer#setupPipelineInternal.   The updateBlockForPipeline RPC does not 
> have any log info. So I think we should add some logs in this RPC.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688749#comment-17688749
 ] 

ASF GitHub Bot commented on HDFS-16914:
---

slfan1989 commented on code in PR #5381:
URL: https://github.com/apache/hadoop/pull/5381#discussion_r1106493967


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -5943,6 +5943,8 @@ LocatedBlock bumpBlockGenerationStamp(ExtendedBlock block,
 }
 // Ensure we record the new generation stamp
 getEditLog().logSync();
+LOG.info("bumpBlockGenerationStamp({}, client={}) success",
+locatedBlock.getBlock(), clientName);

Review Comment:
   @tomscut We record block information, will there be a lot of logs? Is it 
changed to debug?





> Add some logs for updateBlockForPipeline RPC.
> -
>
> Key: HDFS-16914
> URL: https://issues.apache.org/jira/browse/HDFS-16914
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Assignee: ZhangHB
>Priority: Minor
>  Labels: pull-request-available
>
> Recently,we received an phone alarm about missing blocks.  We found logs in 
> one datanode where the block was placed on  like below:
>  
> {code:java}
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231832415 src: 
> /clientAddress:44638 dest: /localAddress:50010 of size 45733720
> 2023-02-09 15:05:10,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received BP-578784987-x.x.x.x-1667291826362:blk_1305044966_231826462 src: 
> /upStreamDatanode:60316 dest: /localAddress:50010 of size 45733720 {code}
> the datanode received the same block with different generation stamp because 
> of socket timeout exception.  blk_1305044966_231826462 is received from 
> upstream datanode in pipeline which has two datanodes.  
> blk_1305044966_231832415 is received from client directly.   
>  
> we have search all log info about blk_1305044966 in namenode and three 
> datanodes in original pipeline. but we could not obtain any helpful message 
> about the generation stamp 231826462.  After diving into the source code,  it 
> was assigned in NameNodeRpcServer#updateBlockForPipeline which was invoked in 
> DataStreamer#setupPipelineInternal.   The updateBlockForPipeline RPC does not 
> have any log info. So I think we should add some logs in this RPC.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688746#comment-17688746
 ] 

ASF GitHub Bot commented on HDFS-16896:
---

mccormickt12 commented on PR #5322:
URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1430524690

   > The change generally looks good to me.
   > 
   > One concern is that when `refetchLocation` is called inside 
`hedgedFetchBlockByteRange` we could remove a node from the ignoredList that is 
already part of the futures array. This would lead to multiple reads to the 
same node.
   > 
   > What I'm thinking of is
   > 
   > 1. Node A is added to futures array
   > 2. getFirstToComplete throws an InterruptedException when doing 
hedgedService.take()
   > 3. We call retchLocation, which remove node A from ignored list.
   > 4. The while loop re-adds Node A to the futures list.
   > 
   > I don't know if this actually can happen. Even if it is technically 
possible, it may not be an issue. Thoughts?
   
   yes @simbadzina I agree this was an issue. I've resolved this now. As you 
pointed out the future is removed in `getFirstToComplete`, so now we let it 
spin in the while loop, each time a future will be removed, and then once 
futures is empty and refetch is needed we will clear the ignore list




> HDFS Client hedged read has increased failure rate than without hedged read
> ---
>
> Key: HDFS-16896
> URL: https://issues.apache.org/jira/browse/HDFS-16896
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Reporter: Tom McCormick
>Assignee: Tom McCormick
>Priority: Major
>  Labels: pull-request-available
>
> When hedged read is enabled by HDFS client, we see an increased failure rate 
> on reads.
> *stacktrace*
>  
> {code:java}
> Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain 
> block: BP-1183972111-10.197.192.88-1590025572374:blk_17114848218_16043459722 
> file=/data/tracking/streaming/AdImpressionEvent/daily/2022/07/18/compaction_1/part-r-1914862.1658217125623.1362294472.orc
> at 
> org.apache.hadoop.hdfs.DFSInputStream.refetchLocations(DFSInputStream.java:1077)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1060)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:1039)
> at 
> org.apache.hadoop.hdfs.DFSInputStream.hedgedFetchBlockByteRange(DFSInputStream.java:1365)
> at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1572)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1535)
> at org.apache.hadoop.fs.FSInputStream.readFully(FSInputStream.java:121)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at 
> org.apache.hadoop.fs.RetryingInputStream.lambda$readFully$3(RetryingInputStream.java:172)
> at org.apache.hadoop.fs.RetryPolicy.lambda$run$0(RetryPolicy.java:137)
> at org.apache.hadoop.fs.NoOpRetryPolicy.run(NoOpRetryPolicy.java:36)
> at org.apache.hadoop.fs.RetryPolicy.run(RetryPolicy.java:136)
> at 
> org.apache.hadoop.fs.RetryingInputStream.readFully(RetryingInputStream.java:168)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at 
> org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:112)
> at 
> io.trino.plugin.hive.orc.HdfsOrcDataSource.readInternal(HdfsOrcDataSource.java:76)
> ... 46 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688529#comment-17688529
 ] 

ASF GitHub Bot commented on HDFS-16915:
---

hfutatzhanghb commented on PR #5392:
URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429855095

   > @Hexiaoqiao , hi, could you please take a look at this.
   
   For the convenience of the existing metrics name, i keep the metric names as 
original.




> Optimize metrics for operations hold lock times of FsDatasetImpl
> 
>
> Key: HDFS-16915
> URL: https://issues.apache.org/jira/browse/HDFS-16915
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Current calculation method also includes the time of waiting lock. So, i 
> think we should optimize the compute method of metrics for operations hold 
> lock times of FsDatasetImpl.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688520#comment-17688520
 ] 

ASF GitHub Bot commented on HDFS-16915:
---

hadoop-yetus commented on PR #5392:
URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429817139

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 40s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  43m 55s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  7s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  8s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 18s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 53s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 52 unchanged - 
0 fixed = 53 total (was 52)  |
   | +1 :green_heart: |  mvnsite  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  25m 46s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 213m 21s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 50s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 332m 23s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5392/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5392 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 9d84f26e215a 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 0620f38049cfeeac016452c84324c2e05625c234 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 

[jira] [Created] (HDFS-16916) Improve the use of JUnit Test in DFSClient

2023-02-14 Thread Hualong Zhang (Jira)
Hualong Zhang created HDFS-16916:


 Summary: Improve the use of JUnit Test in DFSClient
 Key: HDFS-16916
 URL: https://issues.apache.org/jira/browse/HDFS-16916
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Affects Versions: 3.4.0
Reporter: Hualong Zhang


Improve the use of JUnit Test in DFSClient



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688463#comment-17688463
 ] 

ASF GitHub Bot commented on HDFS-16761:
---

hadoop-yetus commented on PR #5390:
URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1429581205

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +0 :ok: |  xmllint  |   0m  0s |  |  xmllint was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
    _ trunk Compile Tests _ |
   | +0 :ok: |  mvndep  |  15m 31s |  |  Maven dependency ordering for branch  |
   | +1 :green_heart: |  mvninstall  |  31m 46s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  72m 32s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +0 :ok: |  mvndep  |   0m 29s |  |  Maven dependency ordering for patch  |
   | +1 :green_heart: |  mvninstall  |   2m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  shadedclient  |  25m 20s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 104m  2s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5390/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5390 |
   | Optional Tests | dupname asflicense shadedclient codespell detsecrets 
xmllint |
   | uname | Linux d3bf8d0f5cca 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 
18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 249046de348e4e82924e0a4f3c54b21156505cdc |
   | Max. process+thread count | 566 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs 
hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5390/2/console |
   | versions | git=2.25.1 maven=3.6.3 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Namenode UI for Datanodes page not loading if any data node is down
> ---
>
> Key: HDFS-16761
> URL: https://issues.apache.org/jira/browse/HDFS-16761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Krishna Reddy
>Assignee: Zita Dombi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2
>
>
> Steps to reproduce:
> - Install the hadoop components and add 3 datanodes
> - Enable namenode HA 
> - Open Namenode UI and check datanode page 
> - check all datanodes will display
> - Now make one datanode down
> - wait for 10 minutes time as heartbeat expires
> - Refresh namenode page and check
>  
> Actual Result: It is showing error message "NameNode is still loading. 
> Redirecting to the Startup Progress page."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16673) Fix usage of chown

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688459#comment-17688459
 ] 

ASF GitHub Bot commented on HDFS-16673:
---

GuoPhilipse closed pull request #4602: HDFS-16673. Fix usage of chown
URL: https://github.com/apache/hadoop/pull/4602




> Fix usage of chown
> --
>
> Key: HDFS-16673
> URL: https://issues.apache.org/jira/browse/HDFS-16673
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 3.3.3
>Reporter: guophilipse
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> actually `chown`  command can be used for the owner of the files or the super 
> user, we need to correct the doc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688398#comment-17688398
 ] 

ASF GitHub Bot commented on HDFS-16761:
---

dombizita commented on PR #5390:
URL: https://github.com/apache/hadoop/pull/5390#issuecomment-1429432336

   thanks for the review @goiri and @ayushtkn, I updated my patch with the 
suggested change in `federationhealth.html`.




> Namenode UI for Datanodes page not loading if any data node is down
> ---
>
> Key: HDFS-16761
> URL: https://issues.apache.org/jira/browse/HDFS-16761
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.2
>Reporter: Krishna Reddy
>Assignee: Zita Dombi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.2.2
>
>
> Steps to reproduce:
> - Install the hadoop components and add 3 datanodes
> - Enable namenode HA 
> - Open Namenode UI and check datanode page 
> - check all datanodes will display
> - Now make one datanode down
> - wait for 10 minutes time as heartbeat expires
> - Refresh namenode page and check
>  
> Actual Result: It is showing error message "NameNode is still loading. 
> Redirecting to the Startup Progress page."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688367#comment-17688367
 ] 

ASF GitHub Bot commented on HDFS-16915:
---

hfutatzhanghb commented on PR #5392:
URL: https://github.com/apache/hadoop/pull/5392#issuecomment-1429338146

   @Hexiaoqiao , hi, could you please take a look at this.




> Optimize metrics for operations hold lock times of FsDatasetImpl
> 
>
> Key: HDFS-16915
> URL: https://issues.apache.org/jira/browse/HDFS-16915
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Current calculation method also includes the time of waiting lock. So, i 
> think we should optimize the compute method of metrics for operations hold 
> lock times of FsDatasetImpl.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16915:
--
Labels: pull-request-available  (was: )

> Optimize metrics for operations hold lock times of FsDatasetImpl
> 
>
> Key: HDFS-16915
> URL: https://issues.apache.org/jira/browse/HDFS-16915
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Major
>  Labels: pull-request-available
>
> Current calculation method also includes the time of waiting lock. So, i 
> think we should optimize the compute method of metrics for operations hold 
> lock times of FsDatasetImpl.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688364#comment-17688364
 ] 

ASF GitHub Bot commented on HDFS-16915:
---

hfutatzhanghb opened a new pull request, #5392:
URL: https://github.com/apache/hadoop/pull/5392

   JIRA: https://issues.apache.org/jira/browse/HDFS-16915
   
   Current calculation method also includes the time of waiting lock. So, i 
think we should optimize the compute method of metrics for operations hold lock 
times of FsDatasetImpl.




> Optimize metrics for operations hold lock times of FsDatasetImpl
> 
>
> Key: HDFS-16915
> URL: https://issues.apache.org/jira/browse/HDFS-16915
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.3.4
>Reporter: ZhangHB
>Priority: Major
>
> Current calculation method also includes the time of waiting lock. So, i 
> think we should optimize the compute method of metrics for operations hold 
> lock times of FsDatasetImpl.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16915) Optimize metrics for operations hold lock times of FsDatasetImpl

2023-02-14 Thread ZhangHB (Jira)
ZhangHB created HDFS-16915:
--

 Summary: Optimize metrics for operations hold lock times of 
FsDatasetImpl
 Key: HDFS-16915
 URL: https://issues.apache.org/jira/browse/HDFS-16915
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 3.3.4
Reporter: ZhangHB


Current calculation method also includes the time of waiting lock. So, i think 
we should optimize the compute method of metrics for operations hold lock times 
of FsDatasetImpl.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16914) Add some logs for updateBlockForPipeline RPC.

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688350#comment-17688350
 ] 

ASF GitHub Bot commented on HDFS-16914:
---

hadoop-yetus commented on PR #5381:
URL: https://github.com/apache/hadoop/pull/5381#issuecomment-1429286505

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 53s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  47m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 28s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  checkstyle  |   1m  7s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 31s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  7s |  |  trunk passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  trunk passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 34s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  28m 57s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 28s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javac  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  javac  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 54s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_352-8u352-ga-1~20.04-b08  |
   | +1 :green_heart: |  spotbugs  |   3m 31s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  28m 39s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 227m 34s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 354m 51s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.datanode.TestDirectoryScanner |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.42 ServerAPI=1.42 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5381 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 4c42e84abc99 4.15.0-197-generic #208-Ubuntu SMP Tue Nov 1 
17:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 26a24245b3bd80b8f4fde78ad252e0a96f2fb865 |
   | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5381/4/testReport/ |
   | Max. process+thread count | 2122 (vs. ulimit of 5500) |
   | modules | C: