[
https://issues.apache.org/jira/browse/HDFS-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760256#comment-17760256
]
ASF GitHub Bot commented on HDFS-17166:
---------------------------------------
hadoop-yetus commented on PR #5990:
URL: https://github.com/apache/hadoop/pull/5990#issuecomment-1698613947
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 28s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 1s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 3 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 31m 26s | | trunk passed |
| +1 :green_heart: | compile | 0m 32s | | trunk passed with JDK
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 |
| +1 :green_heart: | compile | 0m 30s | | trunk passed with JDK
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| +1 :green_heart: | checkstyle | 0m 25s | | trunk passed |
| +1 :green_heart: | mvnsite | 0m 33s | | trunk passed |
| +1 :green_heart: | javadoc | 0m 35s | | trunk passed with JDK
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 |
| +1 :green_heart: | javadoc | 0m 26s | | trunk passed with JDK
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| +1 :green_heart: | spotbugs | 1m 0s | | trunk passed |
| -1 :x: | shadedclient | 22m 58s | | branch has errors when building
and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 0m 24s | | the patch passed |
| +1 :green_heart: | compile | 0m 25s | | the patch passed with JDK
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 |
| +1 :green_heart: | javac | 0m 25s | | the patch passed |
| +1 :green_heart: | compile | 0m 23s | | the patch passed with JDK
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| +1 :green_heart: | javac | 0m 23s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 14s | | the patch passed |
| +1 :green_heart: | mvnsite | 0m 24s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 22s | | the patch passed with JDK
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 |
| +1 :green_heart: | javadoc | 0m 20s | | the patch passed with JDK
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| +1 :green_heart: | spotbugs | 0m 54s | | the patch passed |
| -1 :x: | shadedclient | 23m 21s | | patch has errors when building
and testing our client artifacts. |
|||| _ Other Tests _ |
| +1 :green_heart: | unit | 19m 27s | | hadoop-hdfs-rbf in the patch
passed. |
| +1 :green_heart: | asflicense | 0m 31s | | The patch does not
generate ASF License warnings. |
| | | 108m 38s | | |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.43 ServerAPI=1.43 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5990/6/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/5990 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux c842325ffa1f 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19
13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / ff812317821414e9758f8d441e21aee826d75207 |
| Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5990/6/testReport/ |
| Max. process+thread count | 2567 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U:
hadoop-hdfs-project/hadoop-hdfs-rbf |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5990/6/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> RBF: Throwing NoNamenodesAvailableException for a long time, when failover
> --------------------------------------------------------------------------
>
> Key: HDFS-17166
> URL: https://issues.apache.org/jira/browse/HDFS-17166
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Jian Zhang
> Priority: Major
> Labels: pull-request-available
> Attachments: HDFS-17166.001.patch, HDFS-17166.002.patch,
> HDFS-17166.003.patch, HDFS-17166.004.patch, HDFS-17166.005.patch,
> image-2023-08-26-11-48-22-131.png, image-2023-08-26-11-56-50-181.png,
> image-2023-08-26-11-59-25-153.png, image-2023-08-26-12-01-39-968.png,
> image-2023-08-26-12-06-01-275.png, image-2023-08-26-12-07-47-010.png,
> image-2023-08-26-22-45-46-814.png, image-2023-08-26-22-47-22-276.png,
> image-2023-08-26-22-47-41-988.png, image-2023-08-26-22-48-02-086.png,
> image-2023-08-26-22-48-12-352.png
>
>
> When ns failover, the router may record that the ns have no active namenode,
> the router cannot find the active nn in the ns for about 1 minute. The client
> will report an error after consuming the number of retries, and the router
> will be unable to provide services for the ns for a long time.
> 11:52:44 Start reporting
> !image-2023-08-26-12-06-01-275.png|width=800,height=100!
> 11:53:46 end reporting
> !image-2023-08-26-12-07-47-010.png|width=800,height=20!
>
> At this point, the failover has been successfully completed in the ns, and
> the client can directly connect to the active namenode to access it
> successfully, but the client cannot access the ns through router for up to a
> minute
>
> *There is a bug in this logic:*
> * A certain ns starts to fail over,
> * There is a state where there is no active nn in ns, Router reports the
> status (no active nn) to the state store
> * After a period of time, the router pulls the state store data to update
> the cache, and the cache records that the ns has no active nn
> * Failover successfully completed, at which point the ns actually has an
> active nn
> * Assuming it's not time for router to update the cache yet
> * The client sent a request to the router for the ns, and the router
> accessed the first nn of the ns in the router’s cache (no active nn)
> * Unfortunately, the nn is really standby, so the request went wrong and
> entered the exception handling logic. The router found that there is no
> active nn for the ns in the cache and throw NoNamenodesAvailableException
> * The NoNamenodesAvailableException exception is wrapped as a
> RetrieveException, which causes the client to retry. Since each router
> retrieves the true standby nn in the cache (because it is always the first
> one in the cache and has a high priority), a NoNamenodesAvailableException is
> thrown every time until the router updates the cache from the state store
>
> *How to reproduce*
> # Suppose we have a ns ns60, which contains 2 nn, nn6001 is active and
> nn6002 is standby
> # Assuming that nn6001 and nn6002 are both in standby state, the priority of
> nn6002 is higher than nn6001
> # Use default configuration
> # Shutdown 2 nn's zkfs, {*}hadoop-daemon.sh stop zkfc{*}, manually perform
> failover
> # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60
> -transitionToStandby --forcemanual nn6001*
> # Make sure that the NamenodeHeartbeatService reports that nn6001 is standby
> !image-2023-08-26-11-48-22-131.png|width=800,height=20!
> # Manually switch nn6001 active->standby, *hdfs haadmin -ns ns60
> -transitionToActive --forcemanual nn6001*
> # The client accesses ns60 through router
> !image-2023-08-26-11-56-50-181.png|width=800,height=50!
> # After about one minute, request ns60 again through the router
> !image-2023-08-26-11-59-25-153.png|width=800,height=50!
> # Exceptions are reported for both requests, check the router log
> !image-2023-08-26-12-01-39-968.png|width=800,height=20!
> # The router cannot respond to the client's request for ns60 for a minute
>
>
> *Fix the bug*
> When an ns in the router's cache does not have an active nn, but in reality,
> the ns has an active nn, and the client requests to throw a
> NoNamenodesAvailableException, it is proven that the requested nn is a real
> standby nn. The priority of this nn should be lowered so that the next
> request will find the real active nn, avoiding constantly requesting the real
> standby nn, which will cause the cache to be updated before the next time,
> The router is unable to provide services for the ns to the client.
>
> *Test my patch*
> *1. Unit testing*
> *2. Comparison test*
> * Suppose we have 2 clients [c1 c2], 2 routers [r1 r2] and a ns [ns60], the
> ns has 2 nn [nn6001 nn6002]
> * If both nn6001 and nn6002 are in standby state, the priority of nn6002 is
> higher than nn6001,
> * r1 uses the package that fixing the bug, r2 uses the original package
> which has the bug
> * c1 loops to send requests to r1, and c2 loops to send requests to r2, the
> request is related to ns60
> * Make both nn6001 and nn6002 in standby state
> * After the router reports that nn is in standby state, switch nn6001 to
> active
> *14:15:24* nn6001 is active
> !image-2023-08-26-22-45-46-814.png|width=800,height=120!
> * Check the log of router r1, after nn6001 switches to active, only
> NoNamenodesAvailableException is printed once
> !image-2023-08-26-22-47-22-276.png|width=800,height=30!
>
> * Check the log of router r2, and print NoNamenodesAvailableException for
> more than one minute after nn6001 switches to active
> !image-2023-08-26-22-47-41-988.png|width=800,height=150!
>
> * At 14:16:25, the client c2 accessing the router with the bug could not get
> the data, and the client c1 accessing the router after the bug was fixed
> could get the data normally:
> c2's log:unable to access normally
> !image-2023-08-26-22-48-02-086.png|width=800,height=50!
> c1's log:display the result correctly
> !image-2023-08-26-22-48-12-352.png|width=800,height=150!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]