[jira] [Commented] (HDFS-13977) NameNode can kill itself if it tries to send too many txns to a QJM simultaneously

Erik Krogen (JIRA) Fri, 16 Aug 2019 11:26:58 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16909312#comment-16909312
 ]


Erik Krogen commented on HDFS-13977:
------------------------------------

Digging into this, I was surprised to learn that {{FSEditLog}} already has 
logic to force new writers to wait if an auto-sync is needed (i.e., the 
{{EditsDoubleBuffer}} is full):
{code:title=FSEditLog}
  void logEdit(final FSEditLogOp op) {
    boolean needsSync = false;
    synchronized (this) {
      assert isOpenForWrite() :
        "bad state: " + state;
      
      // wait if an automatic sync is scheduled
      waitIfAutoSyncScheduled();

      // check if it is time to schedule an automatic sync
      needsSync = doEditTransaction(op);
      if (needsSync) {
        isAutoSyncScheduled = true;
      }
    }

    // Sync the log if an automatic sync is required.
    if (needsSync) {
      logSync();
    }
  }
{code}
So why didn't it protect us against the issue described in this JIRA? The 
return value of {{doEditTransaction}} is based on {{shouldForceSync()}}:
{code:title=doEditTransaction}
    return shouldForceSync();
{code}
which in turn calls {{editLogStream.shouldForceSync()}}. However, 
{{QuorumOutputStream}} _didn't override this method_, and instead uses the 
default defined within {{EditLogOutputStream}}:
{code}
  public boolean shouldForceSync() {
    return false;
  }
{code}
Thus it would never force a sync. Overriding this method fixes the issue.

Testing it was a bit tricky but I managed to get something that works, and 
fails if my fix is not applied.

[~shv], can you help to review?

> NameNode can kill itself if it tries to send too many txns to a QJM 
> simultaneously
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13977
>                 URL: https://issues.apache.org/jira/browse/HDFS-13977
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode, qjm
>    Affects Versions: 2.7.7
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>         Attachments: HDFS-13977.000.patch
>
>
> h3. Problem & Logs
> We recently encountered an issue on a large cluster (running 2.7.4) in which 
> the NameNode killed itself because it was unable to communicate with the JNs 
> via QJM. We discovered that it was the result of the NameNode trying to send 
> a huge batch of over 1 million transactions to the JNs in a single RPC:
> {code:title=NameNode Logs}
> WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote 
> journal X.X.X.X:XXXX failed to
>  write txns 10000000-11153636. Will try to write to this JN again after the 
> next log roll.
> ...
> WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 1098ms 
> to send a batch of 1153637 edits (335886611 bytes) to remote journal 
> X.X.X.X:XXXX
> {code}
> {code:title=JournalNode Logs}
> INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: 
> readAndProcess from client X.X.X.X threw exception [java.io.IOException: 
> Requested data length 335886776 is longer than maximum configured RPC length 
> 67108864.  RPC came from X.X.X.X]
> java.io.IOException: Requested data length 335886776 is longer than maximum 
> configured RPC length 67108864.  RPC came from X.X.X.X
>         at 
> org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:897)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:753)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:724)
> {code}
> The JournalNodes rejected the RPC because it had a size well over the 64MB 
> default {{ipc.maximum.data.length}}.
> This was triggered by a huge number of files all hitting a hard lease timeout 
> simultaneously, causing the NN to force-close them all at once. This can be a 
> particularly nasty bug as the NN will attempt to re-send this same huge RPC 
> on restart, as it loads an fsimage which still has all of these open files 
> that need to be force-closed.
> h3. Proposed Solution
> To solve this we propose to modify {{EditsDoubleBuffer}} to add a "hard 
> limit" based on the value of {{ipc.maximum.data.length}}. When {{writeOp()}} 
> or {{writeRaw()}} is called, first check the size of {{bufCurrent}}. If it 
> exceeds the hard limit, block the writer until the buffer is flipped and 
> {{bufCurrent}} becomes {{bufReady}}. This gives some self-throttling to 
> prevent the NameNode from killing itself in this way.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13977) NameNode can kill itself if it tries to send too many txns to a QJM simultaneously

Reply via email to