[jira] [Commented] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

ASF GitHub Bot (Jira) Fri, 25 Aug 2023 18:49:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759192#comment-17759192
 ]


ASF GitHub Bot commented on HDFS-17166:
---------------------------------------

KeeProMise commented on code in PR #5990:
URL: https://github.com/apache/hadoop/pull/5990#discussion_r1306285491


##########
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/resolver/MembershipNamenodeResolver.java:
##########
@@ -478,4 +476,27 @@ private List<MembershipState> 
getRecentRegistrationForQuery(
   public void setRouterId(String router) {
     this.routerId = router;
   }
+
+  /**
+   * Shuffle cache, to ensure that the current nn will not be accessed first 
next time.
+   *
+   *
+   * @param nsId name service id
+   * @param namenode namenode contexts
+   */
+  @Override
+  public synchronized void shuffleCache(String nsId, FederationNamenodeContext 
namenode) {
+    cacheNS.compute(Pair.of(nsId, false), (ns, namenodeContexts) -> {
+      if (namenodeContexts != null
+              && namenodeContexts.size() > 0

Review Comment:
   The reason for not judging outside here is that we should ensure that get 
cache and modify cache are atomic, considering the following situation:
   
   1. After **Thread1** judges that the cache is not empty, it gets the ns no 
active in the cache
   2. **Thread2** (loadCache thread  )clear the cache
   3. **Thread3** processes the client request and finds that the cache is 
empty, and updates the cache. At this time, the ns - in the cache has active nn
   4. **Thread1** rotates the previously acquired data of non-active nn and 
writes it into the cache, causing ns in the cache to have no active nn
   





> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --------------------------------------------------------------------------
>
>                 Key: HDFS-17166
>                 URL: https://issues.apache.org/jira/browse/HDFS-17166
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Jian Zhang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 
> fix_NoNamenodesAvailableException_long_time_when_ns_failover.patch, 
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png
>
>
> When ns failover， the router may record that the ns have no active namenode, 
> the router cannot find the active nn in the ns for about 1 minute. The client 
> will report an error after consuming the number of retries, and the router 
> will be unable to provide services for the ns for a long time.
>  11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>  
> At this point, the failover has been successfully completed in the ns, and 
> the client can directly connect to the active namenode to access it 
> successfully, but the client cannot access the ns through router for up to a 
> minute
>  
> *There is a bug in this logic：*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns,  Router reports the 
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update the 
> cache, and the cache records that the ns has no active nn
> *  Failover successfully completed, at which point the ns actually has an 
> active nn
> *  Assuming it's not time for router to update the cache yet
> *  The client sent a request to the router for the ns, and the router 
> accessed the first nn of the ns in the router’s cache (no active nn)
> *  Unfortunately, the nn is really standby, so the request went wrong and 
> entered the exception handling logic. The router found that there is no 
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> *  The NoNamenodesAvailableException exception is wrapped as a 
> RetrieveException, which causes the client to retry. Since each router 
> retrieves the true standby nn in the cache (because it is always the first 
> one in the cache and has a high priority), a NoNamenodesAvailableException is 
> thrown every time until the router updates the cache from the state store
>  
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality, 
> the ns has an active nn, and the client requests to throw a 
> NoNamenodesAvailableException, it is proven that the requested nn is a real 
> standby nn. The priority of this nn should be lowered so that the next 
> request will find the real active nn, avoiding constantly requesting the real 
> standby nn, which will cause the cache to be updated before the next time, 
> The router is unable to provide services for the ns to the client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17166) RBF: Throwing NoNamenodesAvailableException for a long time, when failover

Reply via email to