[jira] [Comment Edited] (HDFS-13977) NameNode can kill itself if it tries to send too many txns to a QJM simultaneously

Erik Krogen (Jira) Tue, 20 Aug 2019 08:23:54 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911452#comment-16911452
 ]


Erik Krogen edited comment on HDFS-13977 at 8/20/19 3:22 PM:
-------------------------------------------------------------

Thanks for taking a look [~shv] and [~jojochuang]!

{quote}
I understand this becomes a problem when people stop using 
EditLogFileOutputStream along with QuorumOutputStream. When they use both 
FileOutputStream would trigger the sync, which is performed uniformly for all 
output streams both File and Quorum. But if one uses QuorumOutputStream only, 
then sync is never triggered as you discovered.
{quote}
Yes, great point. This could help to explain why this issue hasn't been more 
widespread.

{quote}
Could you please give more details about the test. I understand the mocked 
spyLoggers counts number of calls to sendEdits(). With your fix sendEdits() is 
called only once on each of the three streams. Why is it called 5 times when 
sync is never enforced (shouldForceSync() = false)?
{quote}
With my fix reverted from {{QuorumOutputStream}}, {{sendEdits()}} is never 
called. There are, however, 5 _total interactions_ with each mock, which you 
may be confusing with interactions to {{sendEdits()}}. See the test failure 
output with the fix reverted:
{code}
Wanted but not invoked:
asyncLogger.sendEdits(
    1L,
    1L,
    <any integer>,
    <any>
);
-> at 
org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManagerUnit.testFSEditLogAutoSyncToQuorumStream(TestQuorumJournalManagerUnit.java:255)

However, there were exactly 5 interactions with this mock:
asyncLogger.getJournalState();
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournalState(AsyncLoggerSet.java:212)

asyncLogger.newEpoch(1L);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.newEpoch(AsyncLoggerSet.java:231)

asyncLogger.setEpoch(1L);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.setEpoch(AsyncLoggerSet.java:66)

asyncLogger.startLogSegment(1L, -65);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.startLogSegment(AsyncLoggerSet.java:240)

asyncLogger.startLogSegment(1L, -65);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.startLogSegment(AsyncLoggerSet.java:240)
{code}
There were 0 calls to {{sendEdits()}}, but 5 other calls on the mock.

Edit: Also attached v001 patch fixing the checkstyle issue. I just removed the 
locally created {{conf}} and used the one from the test setup.


was (Author: xkrogen):
Thanks for taking a look [~shv] and [~jojochuang]!

{quote}
I understand this becomes a problem when people stop using 
EditLogFileOutputStream along with QuorumOutputStream. When they use both 
FileOutputStream would trigger the sync, which is performed uniformly for all 
output streams both File and Quorum. But if one uses QuorumOutputStream only, 
then sync is never triggered as you discovered.
{quote}
Yes, great point. This could help to explain why this issue hasn't been more 
widespread.

{quote}
Could you please give more details about the test. I understand the mocked 
spyLoggers counts number of calls to sendEdits(). With your fix sendEdits() is 
called only once on each of the three streams. Why is it called 5 times when 
sync is never enforced (shouldForceSync() = false)?
{quote}
With my fix reverted from {{QuorumOutputStream}}, {{sendEdits()}} is never 
called. There are, however, 5 _total interactions_ with each mock, which you 
may be confusing with interactions to {{sendEdits()}}. See the test failure 
output with the fix reverted:
{code}
Wanted but not invoked:
asyncLogger.sendEdits(
    1L,
    1L,
    <any integer>,
    <any>
);
-> at 
org.apache.hadoop.hdfs.qjournal.client.TestQuorumJournalManagerUnit.testFSEditLogAutoSyncToQuorumStream(TestQuorumJournalManagerUnit.java:255)

However, there were exactly 5 interactions with this mock:
asyncLogger.getJournalState();
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.getJournalState(AsyncLoggerSet.java:212)

asyncLogger.newEpoch(1L);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.newEpoch(AsyncLoggerSet.java:231)

asyncLogger.setEpoch(1L);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.setEpoch(AsyncLoggerSet.java:66)

asyncLogger.startLogSegment(1L, -65);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.startLogSegment(AsyncLoggerSet.java:240)

asyncLogger.startLogSegment(1L, -65);
-> at 
org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.startLogSegment(AsyncLoggerSet.java:240)
{code}
There were 0 calls to {{sendEdits()}}, but 5 other calls on the mock.

> NameNode can kill itself if it tries to send too many txns to a QJM 
> simultaneously
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13977
>                 URL: https://issues.apache.org/jira/browse/HDFS-13977
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode, qjm
>    Affects Versions: 2.7.7
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>         Attachments: HDFS-13977.000.patch, HDFS-13977.001.patch
>
>
> h3. Problem & Logs
> We recently encountered an issue on a large cluster (running 2.7.4) in which 
> the NameNode killed itself because it was unable to communicate with the JNs 
> via QJM. We discovered that it was the result of the NameNode trying to send 
> a huge batch of over 1 million transactions to the JNs in a single RPC:
> {code:title=NameNode Logs}
> WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote 
> journal X.X.X.X:XXXX failed to
>  write txns 10000000-11153636. Will try to write to this JN again after the 
> next log roll.
> ...
> WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 1098ms 
> to send a batch of 1153637 edits (335886611 bytes) to remote journal 
> X.X.X.X:XXXX
> {code}
> {code:title=JournalNode Logs}
> INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8485: 
> readAndProcess from client X.X.X.X threw exception [java.io.IOException: 
> Requested data length 335886776 is longer than maximum configured RPC length 
> 67108864.  RPC came from X.X.X.X]
> java.io.IOException: Requested data length 335886776 is longer than maximum 
> configured RPC length 67108864.  RPC came from X.X.X.X
>         at 
> org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610)
>         at 
> org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672)
>         at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:897)
>         at 
> org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:753)
>         at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:724)
> {code}
> The JournalNodes rejected the RPC because it had a size well over the 64MB 
> default {{ipc.maximum.data.length}}.
> This was triggered by a huge number of files all hitting a hard lease timeout 
> simultaneously, causing the NN to force-close them all at once. This can be a 
> particularly nasty bug as the NN will attempt to re-send this same huge RPC 
> on restart, as it loads an fsimage which still has all of these open files 
> that need to be force-closed.
> h3. Proposed Solution
> To solve this we propose to modify {{EditsDoubleBuffer}} to add a "hard 
> limit" based on the value of {{ipc.maximum.data.length}}. When {{writeOp()}} 
> or {{writeRaw()}} is called, first check the size of {{bufCurrent}}. If it 
> exceeds the hard limit, block the writer until the buffer is flipped and 
> {{bufCurrent}} becomes {{bufReady}}. This gives some self-throttling to 
> prevent the NameNode from killing itself in this way.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-13977) NameNode can kill itself if it tries to send too many txns to a QJM simultaneously

Reply via email to