Jian Zhang created HDFS-17166:
---------------------------------

             Summary: RBF: Throwing NoNamenodesAvailableException for a long 
time, when failover
                 Key: HDFS-17166
                 URL: https://issues.apache.org/jira/browse/HDFS-17166
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Jian Zhang
         Attachments: image-2023-08-26-00-24-02-016.png, 
image-2023-08-26-00-25-42-086.png

When ns failover, the router may record that the ns have no active namenode, 
the router cannot find the active nn in the ns for about 1 minute. The client 
will report an error after consuming the number of retries, and the router will 
be unable to provide services for the ns for a long time.

 11:52:44 Start reporting
!image-2023-08-26-00-24-16-538.png|width=800,height=100!
11:53:46 end reporting

!image-2023-08-26-00-25-42-086.png|width=800,height=50!

 

At this point, the failover has been successfully completed in the ns, and the 
client can directly connect to the active namenode to access it successfully, 
but the client cannot access the ns through router for up to a minute

 

*There is a bug in this logic:*

* A certain ns starts to fail over,

* There is a state where there is no active nn in ns,  Router reports the 
status (no active nn) to the state store

* After a period of time, the router pulls the state store data to update the 
cache, and the cache records that the ns has no active nn
*  Failover successfully completed, at which point the ns actually has an 
active nn

*  Assuming it's not time for router to update the cache yet

*  The client sent a request to the router for the ns, and the router accessed 
the first nn of the ns in the router’s cache (no active nn)

*  Unfortunately, the nn is really standby, so the request went wrong and 
entered the exception handling logic. The router found that there is no active 
nn for the ns in the cache and throw NoNamenodesAvailableException

*  The NoNamenodesAvailableException exception is wrapped as a 
RetrieveException, which causes the client to retry. Since each router 
retrieves the true standby nn in the cache (because it is always the first one 
in the cache and has a high priority), a NoNamenodesAvailableException is 
thrown every time until the router updates the cache from the state store

 

*Fix the bug*

When an ns in the router's cache does not have an active nn, but in reality, 
the ns has an active nn, and the client requests to throw a 
NoNamenodesAvailableException, it is proven that the requested nn is a real 
standby nn. The priority of this nn should be lowered so that the next request 
will find the real active nn, avoiding constantly requesting the real standby 
nn, which will cause the cache to be updated before the next time, The router 
is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to