[ 
https://issues.apache.org/jira/browse/HDFS-17768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17942180#comment-17942180
 ] 

ASF GitHub Bot commented on HDFS-17768:
---------------------------------------

dParikesit opened a new pull request, #7593:
URL: https://github.com/apache/hadoop/pull/7593

   ### Description of PR
   Jira: [HDFS-17768](https://issues.apache.org/jira/browse/HDFS-17768)
   
   In our testing with the latest hdfs version (e8a64d0), we found a similar 
case to [HDFS-16732](https://issues.apache.org/jira/browse/HDFS-16732) 
happening in getBatchedListing. During a getBatchedListing, if the block report 
of the observer nn is delayed, one or more of the listing results will return 
blocks without location.
   
   Steps to reproduce this bug:
   
   1. Start a cluster with 1 observer namenode
   2. Create an empty file
   3. Inject network delay between observer nn and active nn to delay block 
report (or add sleep to the BlockReportProcessingThread of the observer).
   4. Append file to add block
   5. Send a batchedListPaths request using client API
   6. Check that the result has block without location
   
   
   In [HDFS-16732](https://issues.apache.org/jira/browse/HDFS-16732) and 
[HDFS-13924](https://issues.apache.org/jira/browse/HDFS-13924),  a check was 
added in getBlockLocations, getFileInfo, and getListing that checks whether the 
found blocks have valid locations. Missing locations indicate that the observer 
namenode is not up-to-date compared to the active namenode.
   
   We propose to add the same check to getBatchedListing. If any of the 
sub-listing return blocks without location then it will throw 
ObserverRetryOnActiveException and exit the function early. The entire 
batchedListing request will be then retried on active namenode.
   
   Your insights are very much appreciated. We will continue following up this 
issue until it is resolved.
   
   




> Observer namenode network delay causing empty block location for 
> getBatchedListing
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-17768
>                 URL: https://issues.apache.org/jira/browse/HDFS-17768
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.4.1
>            Reporter: Dimas Shidqi Parikesit
>            Priority: Major
>
> In our testing with the latest hdfs version (e8a64d0), we found a similar 
> case to HDFS-16732 happening in getBatchedListing. During a 
> getBatchedListing, if the block report of the observer nn is delayed, one or 
> more of the listing results will return blocks without location.
> Steps to reproduce this bug:
>  # Start a cluster with 1 observer namenode
>  # Create an empty file
>  # Inject network delay between observer nn and active nn to delay block 
> report (or add sleep to the BlockReportProcessingThread of the observer).
>  # Append file to add block
>  # Send a batchedListPaths request using client API
>  # Check that the result has block without location
> In HDFS-16732 and HDFS-13924,  a check was added in getBlockLocations, 
> getFileInfo, and getListing that checks whether the found blocks have valid 
> locations. Missing locations indicate that the observer namenode is not 
> up-to-date compared to the active namenode.
> We propose to add the same check to getBatchedListing. If any of the 
> sub-listing return blocks without location then it will throw 
> ObserverRetryOnActiveException and exit the function early. The entire 
> batchedListing request will be then retried on active namenode.
> Your insights are very much appreciated. We will continue following up this 
> issue until it is resolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to