[
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Cao updated HDFS-14646:
------------------------------
Summary: Standby NameNode should not upload fsimage to an inappropriate
NameNode. (was: Standby NameNode should terminate the FsImage put process
immediately if the peer NN is not in the appropriate state to receive an image.)
> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> ------------------------------------------------------------------------
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.1.2
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Major
> Attachments: HDFS-14646.000.patch, blockedInWritingSocket.png,
> get1.png, get2.png, largeSendQ.png
>
>
> *Problem Description:*
> In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put
> the image to all other NNs (whether the peer NN is an ANN or not), and even
> if the peer NN immediately replies with an error (such as
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put
> process immediately, but will put the FsImage completely to the peer NN, and
> will not read the peer NN's reply until the put is completed.
> In a relatively large HDFS cluster, the size of FsImage can often reach about
> 30GB. In this case, this invalid put brings two problems:
> # Wasting time and bandwidth.
> # Since the ImageServlet of the peer NN no longer receives the FsImage, the
> socket Send-Q of the local SNN is very large, and the ImageUpload thread will
> be blocked in writing socket for a long time, eventually causing the local
> StandbyCheckpointer thread often blocked for several hours.
> *An example is as follows:*
> In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN
> 100.76.3.170 is another SNN, and the 8080 is NN Http port. When the local SNN
> starts to put the FsImage, 170 will reply with a NOT_ACTIVE_NAMENODE_FAILURE
> error immediately. In this case, the local SNN should terminate put
> immediately, but in fact, local SNN has to wait until the image has been
> completely put to the peer NN,and then can read the response.
> # At this time, since the ImageServlet of the peer NN no longer receives the
> FsImage, the socket Send-Q of the local SNN is very large:
> !largeSendQ.png!
> 2. Moreover, the local SNN's ImageUpload thread will be blocked in
> writing socket for a long time:
> !blockedInWritingSocket.png! .
>
> 3. Eventually, the StandbyCheckpointer thread of local SNN is waiting
> for the execution result of the ImageUpload thread, blocking in Future.get(),
> and the blocking time may be as long as several hours:
> !get1.png!
>
> !get2.png!
>
>
> *Solution:*
> When the local SNN plans to put a FsImage to the peer NN, it need to test
> whether he really need to put it at this time. The test process is:
> # Establish an HTTP connection with the peer NN, send the put request, and
> then immediately read the response (this is the key point). If the peer NN
> replies any of the following errors (TransferResult.AUTHENTICATION_FAILURE,
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult.
> OLD_TRANSACTION_ID_FAILURE), immediately terminate the put process.
> # If the peer NN is indeed the Active NameNode AND it's now in the
> appropriate state to receive an image, it will reply an HTTP response 410
> (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE). At
> this time, the local SNN can really begin to put the image.
> *Note:*
> This problem needs to be reproduced in a large cluster (the size of FsImage
> in our cluster is about 30GB). Therefore, unit testing is difficult to write.
> In our cluster, after the modification, the problem has been solved and there
> is no such thing as a large backlog of Send-Q.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]