[
https://issues.apache.org/jira/browse/HDFS-16550?focusedWorklogId=759837&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759837
]
ASF GitHub Bot logged work on HDFS-16550:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 21/Apr/22 08:39
Start Date: 21/Apr/22 08:39
Worklog Time Spent: 10m
Work Description: hadoop-yetus commented on PR #4209:
URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1104880896
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 13m 10s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 0s | | codespell was not available. |
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include
any new or modified tests. Please justify why no new tests are needed for this
patch. Also please list what manual steps were performed to verify this patch.
|
|||| _ trunk Compile Tests _ |
| +1 :green_heart: | mvninstall | 39m 32s | | trunk passed |
| +1 :green_heart: | compile | 1m 43s | | trunk passed with JDK
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 |
| +1 :green_heart: | compile | 1m 37s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | checkstyle | 1m 21s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 42s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 23s | | trunk passed with JDK
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 |
| +1 :green_heart: | javadoc | 1m 49s | | trunk passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 45s | | trunk passed |
| +1 :green_heart: | shadedclient | 23m 24s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 21s | | the patch passed |
| +1 :green_heart: | compile | 1m 25s | | the patch passed with JDK
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 |
| +1 :green_heart: | javac | 1m 25s | | the patch passed |
| +1 :green_heart: | compile | 1m 19s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | javac | 1m 19s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 1m 1s | | the patch passed |
| +1 :green_heart: | mvnsite | 1m 26s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 |
| +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| +1 :green_heart: | spotbugs | 3m 21s | | the patch passed |
| +1 :green_heart: | shadedclient | 22m 22s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| -1 :x: | unit | 257m 47s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 1m 16s | | The patch does not
generate ASF License warnings. |
| | | 381m 4s | | |
| Reason | Tests |
|-------:|:------|
| Failed junit tests |
hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.41 ServerAPI=1.41 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/4209 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell |
| uname | Linux 47901c544c06 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 1b57503a71f692a136ff0a1db219fcdcdf1c1fda |
| Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/testReport/ |
| Max. process+thread count | 3207 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U:
hadoop-hdfs-project/hadoop-hdfs |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
This message was automatically generated.
Issue Time Tracking
-------------------
Worklog Id: (was: 759837)
Time Spent: 20m (was: 10m)
> [SBN read] Improper cache-size for journal node may cause cluster crash
> -----------------------------------------------------------------------
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: tomscut
> Assignee: tomscut
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png,
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade
> the JournalNodes.
> Cluster Info:
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff0000}(related config:
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while.
> 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed
> out waiting 120000ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of
> “{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff0000}The cluster crashed{color}.
>
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
> capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
> DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
> if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
> }
> Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
> "of bytes: " + capacity);
> ReadWriteLock lock = new ReentrantReadWriteLock(true);
> readLock = new AutoCloseableLock(lock.readLock());
> writeLock = new AutoCloseableLock(lock.writeLock());
> initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size
> than the memory requested by the process. If
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 *
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during
> journalnode startup. This can easily be overlooked by users. However, as the
> cluster runs to a certain period of time, it is likely to cause the cluster
> to crash.
>
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
>
> IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold *
> Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and
> {color:#ff0000}fast fail{color}. Giving a clear hint for users to update
> related configurations.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]