[
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jian Zhang updated HDFS-17166:
------------------------------
Attachment: image-2023-08-26-11-48-22-131.png
> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --------------------------------------------------------------------------
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Jian Zhang
> Priority: Major
> Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch,
> image-2023-08-26-00-24-02-016.png, image-2023-08-26-00-25-42-086.png,
> image-2023-08-26-11-48-07-378.png, image-2023-08-26-11-48-22-131.png
>
>
> When ns failover, the router may record that the ns have no active namenode,
> the router cannot find the active nn in the ns for about 1 minute. The client
> will report an error after consuming the number of retries, and the router
> will be unable to provide services for the ns for a long time.
> 11:52:44 Start reporting
> !image-2023-08-26-00-24-02-016.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-00-25-42-086.png|width=800,height=50!
>
> At this point, the failover has been successfully completed in the ns, and
> the client can directly connect to the active namenode to access it
> successfully, but the client cannot access the ns through router for up to a
> minute
>
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns, Router reports the
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update
> the cache, and the cache records that the ns has no active nn
> * Failover successfully completed, at which point the ns actually has an
> active nn
> * Assuming it's not time for router to update the cache yet
> * The client sent a request to the router for the ns, and the router
> accessed the first nn of the ns in the router’s cache (no active nn)
> * Unfortunately, the nn is really standby, so the request went wrong and
> entered the exception handling logic. The router found that there is no
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> * The NoNamenodesAvailableException exception is wrapped as a
> RetrieveException, which causes the client to retry. Since each router
> retrieves the true standby nn in the cache (because it is always the first
> one in the cache and has a high priority), a NoNamenodesAvailableException is
> thrown every time until the router updates the cache from the state store
>
> *How to reproduce*
> # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and
> nn6002 is standby
> # Use default configuration
> # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform
> failover
> # Manually switch nn6001 active->standby, hdfs haadmin -ns ns60
> -transitionToStandby --forcemanual nn6001
> # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
> #
>
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality,
> the ns has an active nn, and the client requests to throw a
> NoNamenodesAvailableException, it is proven that the requested nn is a real
> standby nn. The priority of this nn should be lowered so that the next
> request will find the real active nn, avoiding constantly requesting the real
> standby nn, which will cause the cache to be updated before the next time,
> The router is unable to provide services for the ns to the client.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]