[ 
https://issues.apache.org/jira/browse/HDFS-16550?focusedWorklogId=759837&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-759837
 ]

ASF GitHub Bot logged work on HDFS-16550:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 21/Apr/22 08:39
            Start Date: 21/Apr/22 08:39
    Worklog Time Spent: 10m 
      Work Description: hadoop-yetus commented on PR #4209:
URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1104880896

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |  13m 10s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  39m 32s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 49s |  |  trunk passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 45s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 24s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  the patch passed with JDK 
Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 22s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 257m 47s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 16s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 381m  4s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | 
hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4209 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 47901c544c06 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 1b57503a71f692a136ff0a1db219fcdcdf1c1fda |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.14.1+1-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/testReport/ |
   | Max. process+thread count | 3207 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4209/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




Issue Time Tracking
-------------------

    Worklog Id:     (was: 759837)
    Time Spent: 20m  (was: 10m)

> [SBN read] Improper cache-size for journal node may cause cluster crash
> -----------------------------------------------------------------------
>
>                 Key: HDFS-16550
>                 URL: https://issues.apache.org/jira/browse/HDFS-16550
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: tomscut
>            Assignee: tomscut
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff0000}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while.
> 3. {color:#ff0000}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 120000ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff0000}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 120000ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff0000}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>       DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
>     Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
>         "maximum JVM memory is only %d bytes. It is recommended that you " +
>         "decrease the cache size or increase the heap size.",
>         capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>       "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
>  
> IMO, when {*}fs.journalNode.edit-cache-size-bytes > threshold * 
> Runtime.getruntime ().maxMemory(){*}, we should throw an Exception and 
> {color:#ff0000}fast fail{color}. Giving a clear hint for users to update 
> related configurations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to