[ https://issues.apache.org/jira/browse/HDFS-13977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909551#comment-16909551 ]
Hadoop QA commented on HDFS-13977: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 52s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 27s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 51s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 171 unchanged - 0 fixed = 172 total (was 171) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 49s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}167m 46s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock | | | hadoop.hdfs.tools.TestDFSZKFailoverController | | | hadoop.metrics2.sink.TestRollingFileSystemSinkWithHdfs | | | hadoop.cli.TestHDFSCLI | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.server.namenode.ha.TestHAAppend | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.1 Server=19.03.1 Image:yetus/hadoop:bdbca0e53b4 | | JIRA Issue | HDFS-13977 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12977834/HDFS-13977.000.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 7cd090bb7fa7 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / a46ba03 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/27535/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/27535/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/27535/testReport/ | | Max. process+thread count | 3133 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/27535/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > NameNode can kill itself if it tries to send too many txns to a QJM > simultaneously > ---------------------------------------------------------------------------------- > > Key: HDFS-13977 > URL: https://issues.apache.org/jira/browse/HDFS-13977 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode, qjm > Affects Versions: 2.7.7 > Reporter: Erik Krogen > Assignee: Erik Krogen > Priority: Major > Attachments: HDFS-13977.000.patch > > > h3. Problem & Logs > We recently encountered an issue on a large cluster (running 2.7.4) in which > the NameNode killed itself because it was unable to communicate with the JNs > via QJM. We discovered that it was the result of the NameNode trying to send > a huge batch of over 1 million transactions to the JNs in a single RPC: > {code:title=NameNode Logs} > WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote > journal X.X.X.X:XXXX failed to > write txns 10000000-11153636. Will try to write to this JN again after the > next log roll. > ... > WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 1098ms > to send a batch of 1153637 edits (335886611 bytes) to remote journal > X.X.X.X:XXXX > {code} > {code:title=JournalNode Logs} > INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: > readAndProcess from client X.X.X.X threw exception [java.io.IOException: > Requested data length 335886776 is longer than maximum configured RPC length > 67108864. RPC came from X.X.X.X] > java.io.IOException: Requested data length 335886776 is longer than maximum > configured RPC length 67108864. RPC came from X.X.X.X > at > org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610) > at > org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672) > at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:897) > at > org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:753) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:724) > {code} > The JournalNodes rejected the RPC because it had a size well over the 64MB > default {{ipc.maximum.data.length}}. > This was triggered by a huge number of files all hitting a hard lease timeout > simultaneously, causing the NN to force-close them all at once. This can be a > particularly nasty bug as the NN will attempt to re-send this same huge RPC > on restart, as it loads an fsimage which still has all of these open files > that need to be force-closed. > h3. Proposed Solution > To solve this we propose to modify {{EditsDoubleBuffer}} to add a "hard > limit" based on the value of {{ipc.maximum.data.length}}. When {{writeOp()}} > or {{writeRaw()}} is called, first check the size of {{bufCurrent}}. If it > exceeds the hard limit, block the writer until the buffer is flipped and > {{bufCurrent}} becomes {{bufReady}}. This gives some self-throttling to > prevent the NameNode from killing itself in this way. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org