[
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian Zhang updated HDFS-17166:
------------------------------
Attachment: HDFS-17166.004.patch
> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --------------------------------------------------------------------------
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Jian Zhang
> Priority: Major
> Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch,
> HDFS-17166.003.patch, HDFS-17166.004.patch,
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png,
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png,
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png,
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png,
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png,
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode,
> the router cannot find the active nn in the ns for about 1 minute. The client
> will report an error after consuming the number of retries, and the router
> will be unable to provide services for the ns for a long time.
> 11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>
> At this point, the failover has been successfully completed in the ns, and
> the client can directly connect to the active namenode to access it
> successfully, but the client cannot access the ns through router for up to a
> minute
>
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns, Router reports the
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update
> the cache, and the cache records that the ns has no active nn
> * Failover successfully completed, at which point the ns actually has an
> active nn
> * Assuming it's not time for router to update the cache yet
> * The client sent a request to the router for the ns, and the router
> accessed the first nn of the ns in the router’s cache (no active nn)
> * Unfortunately, the nn is really standby, so the request went wrong and
> entered the exception handling logic. The router found that there is no
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> * The NoNamenodesAvailableException exception is wrapped as a
> RetrieveException, which causes the client to retry. Since each router
> retrieves the true standby nn in the cache (because it is always the first
> one in the cache and has a high priority), a NoNamenodesAvailableException is
> thrown every time until the router updates the cache from the state store
>
> *How to reproduce*
> # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and
> nn6002 is standby
> # Assuming that nn6001 and nn6002 are both in standby state, the priority of
> nn6002 is higher than nn6001
> # Use default configuration
> # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform
> failover
> # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60
> -transitionToStandby --forcemanual nn6001*
> # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
> !image-2023-08-26-11-48-22-131.png|width=800,height=20!
> # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60
> -transitionToActive --forcemanual nn6001*
> # The client accesses ns60 through router
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
> # After about one minute, request ns60 again through the router
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
> # Exceptions are reported for both requests, check the router log
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
> # The router cannot respond to the client's request for ns60 for a minute
>
>
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality,
> the ns has an active nn, and the client requests to throw a
> NoNamenodesAvailableException, it is proven that the requested nn is a real
> standby nn. The priority of this nn should be lowered so that the next
> request will find the real active nn, avoiding constantly requesting the real
> standby nn, which will cause the cache to be updated before the next time,
> The router is unable to provide services for the ns to the client.
>
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
> * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the
> ns has 2 nn [nn6001 nn6002]
> * If both nn6001 and nn6002 are in standby state, the priority of nn6002 is
> higher than nn6001,
> * r1 uses the package that fixing the bug, r2 uses the original package
> which has the bug
> * c1 loops to send requests to r1, and c2 loops to send requests to r2, the
> request is related to ns60
> * Make both nn6001 and nn6002 in standby state
> * After the router reports that nn is in standby state, switch nn6001 to
> active
> *14:15:24* nn6001 is active
> !image-2023-08-26-22-45-46-814.png|width=800,height=120!
> * Check the log of router r1, after nn6001 switches to active, only
> NoNamenodesAvailableException is printed once
> !image-2023-08-26-22-47-22-276.png|width=800,height=30!
>
> * Check the log of router r2, and print NoNamenodesAvailableException for
> more than one minute after nn6001 switches to active
> !image-2023-08-26-22-47-41-988.png|width=800,height=150!
>
> * At 14:16:25, the client c2 accessing the router with the bug could not get
> the data, and the client c1 accessing the router after the bug was fixed
> could get the data normally:
> c2's log:unable to access normally
> !image-2023-08-26-22-48-02-086.png|width=800,height=50!
> c1's log:display the result correctly
> !image-2023-08-26-22-48-12-352.png|width=800,height=150!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]