[
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17622609#comment-17622609
]
ASF GitHub Bot commented on HDFS-16550:
---------------------------------------
hadoop-yetus commented on PR #4209:
URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1287763426
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 35s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available.
|
| +0 :ok: | xmllint | 0m 0s | | xmllint was not available. |
| +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 2 new or modified test files. |
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 38m 51s | | trunk passed |
| +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | compile | 1m 34s | | trunk passed with JDK
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | checkstyle | 1m 16s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 34s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javadoc | 1m 43s | | trunk passed with JDK
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 35s | | trunk passed |
| +1 :green_heart: | shadedclient | 23m 4s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 20s | | the patch passed |
| +1 :green_heart: | compile | 1m 24s | | the patch passed with JDK
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javac | 1m 24s | | the patch passed |
| +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | javac | 1m 16s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 59s | | the patch passed |
| +1 :green_heart: | mvnsite | 1m 28s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 |
| +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 18s | | the patch passed |
| +1 :green_heart: | shadedclient | 22m 46s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| -1 :x: | unit | 244m 35s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 1m 4s | | The patch does not
generate ASF License warnings. |
| | | 353m 46s | | |
| Reason | Tests |
|-------:|:------|
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestObserverNode |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/4209 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
markdownlint |
| uname | Linux 779ee7881403 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 5e87c99b7d2ad717f64a2d7180d9e736063d0739 |
| Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/testReport/ |
| Max. process+thread count | 2948 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U:
hadoop-hdfs-project/hadoop-hdfs |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/4/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> [SBN read] Improper cache-size for journal node may cause cluster crash
> -----------------------------------------------------------------------
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Tao Li
> Assignee: Tao Li
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png,
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
> Time Spent: 1h
> Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade
> the JournalNodes.
> Cluster Info:
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff0000}(related config:
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory
> is used up.
> 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed
> out waiting 120000ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of
> “{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff0000}The cluster crashed{color}.
>
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
> capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
> DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
> if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
> }
> Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
> "of bytes: " + capacity);
> ReadWriteLock lock = new ReentrantReadWriteLock(true);
> readLock = new AutoCloseableLock(lock.readLock());
> writeLock = new AutoCloseableLock(lock.writeLock());
> initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size
> than the memory requested by the process. If
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 *
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during
> journalnode startup. This can easily be overlooked by users. However, as the
> cluster runs to a certain period of time, it is likely to cause the cluster
> to crash.
>
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold *
> Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and
> {color:#ff0000}fast fail{color}. Giving a clear hint for users to update
> related configurations. Or if cache-size exceeds 50% (or some other
> threshold) of maxMemory, force cache-size to be 25% of maxMemory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]