[
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xudong Cao updated HDFS-14646:
------------------------------
Attachment: largeSendQ.png
> Standby NameNode should terminate the FsImage put process as soon as possible
> if the peer NN is not in the appropriate state to receive an image.
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.1.2
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Major
> Attachments: get1.png, get2.png, largeSendQ.png
>
>
> *Problem Description:*
> In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put
> the image to all other NNs (whether the peer NN is an ANN or not), and even
> if the peer NN immediately replies with an error (such as
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put
> process immediately, but will put the FsImage completely to the peer NN, and
> will not read the peer NN's reply until the put is completed.
> In a relatively large HDFS cluster, the size of FsImage can often reach about
> 30G. In this case, this invalid put brings two problems:
> # Wasting time and bandwidth.
> # Since the ImageServlet of the peer NN no longer receives the FsImage, the
> socket Send-Q of the local SNN is very large, and the ImageUpload thread will
> be blocked in writing socket for a long time, eventually causing the local
> StandbyCheckpointer thread often blocked for several hours.
> *An example is as follows:*
> In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN
> 100.76.3.170 is another SNN, and the 8080 is NN Http port. When the local SNN
> starts to put FsImage, 170 will reply with a NOT_ACTIVE_NAMENODE_FAILURE
> error immediately. In this case, the local SNN should terminate put
> immediately, but in fact, local SNN has to wait until the image has been
> completely put to peer NN,and then canl read the response.
> # At this time, since the ImageServlet of the peer NN no longer receives the
> FsImage, the socket Send-Q of the local SNN is very large:
>
> 2. Moreover, the local SNN's ImageUpload thread will be blocked in
> writing socket for a long time:
> !blockWriiting.png!
>
> 3. Eventually, the StandbyCheckpointer thread of local SNN is waiting
> for the execution result of the ImageUpload thread, blocking in Future.get(),
> and the blocking time may be as long as several hours:
> !get1.png!
> !get2.png!
>
> *Solution:*
> When the local SNN is ready to put a FsImage to the peer NN, it need to test
> whether he really need to put it at this time. The test process is:
> # Establish an HTTP connection with the peer NN, send a put request, and
> then immediately read the response (this is the key point). If the peer NN
> replies with any of the following errors
> (TransferResult.AUTHENTICATION_FAILURE,
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult.
> # If the peer NN is truly the ANN and can receive the FsImage normally, it
> will reply to the local SNN with an HTTP response 410
> (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE). At
> this time, the local SNN can really begin to put the image.
> *Note:*
> This problem needs to be reproduced in a large cluster (the size of FsImage
> in our cluster is about 30G). Therefore, unit testing is difficult to write.
> In our real cluster, after the modification, the problem has been solved.
> There is no such thing as a large backlog of Send-Q.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]