Xudong Cao created HDFS-14646:
---------------------------------
Summary: Standby NameNode should terminate the FsImage put process
as soon as possible if the peer NN is not in the appropriate state to receive
an image.
Key: HDFS-14646
URL: https://issues.apache.org/jira/browse/HDFS-14646
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs
Affects Versions: 3.1.2
Reporter: Xudong Cao
Assignee: Xudong Cao
Attachments: blockWriiting.png, get1.png, get2.png, largeSendQ.png
*Problem Description:*
In multi-NameNode scenario, when an SNN uploads a FsImage, it will put the
image to all other NNs (whether the peer NN is an ANN or not), and even if the
peer NN immediately replies with an error (such as
TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult
.OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put
process immediately, but will put the FsImage completely to the peer NN, and
will not read the peer NN's reply until the put is completed.
In a relatively large HDFS cluster, the size of FsImage can often reach about
30G. In this case, this invalid put brings two problems:
1. Wasting time and bandwidth.
2. Since the ImageServlet of the peer NN no longer receives the FsImage, the
socket Send-Q of the local SNN is very large, and the ImageUpload thread will
be blocked in writting socket for a long time, eventually causing the local
StandbyCheckpointer thread often blocked for several hours.
*An example is as follows:*
In the following figure, the local NN 100.76.3.234 is an SNN, the peer NN
100.76.3.170 is another SNN, and the 8080 is NN Http port. When the local SNN
starts to put FsImage, 170 will reply with a NOT_ACTIVE_NAMENODE_FAILURE error
immediately. In this case, local SNN should terminate put immediately, but in
fact, local SNN has to wait until the image has been completely put to peer
NN,and then canl read the response.
# At this time, since the ImageServlet of the peer NN no longer receives the
FsImage, the socket Send-Q of the local SNN is very large:
!largeSendQ.png!
2. Moreover, the local SNN's ImageUpload thread will be blocked in
writing socket for a long time:
!blockWriiting.png!
3. Eventually, the StandbyCheckpointer thread of local SNN is waiting for
the execution result of the ImageUpload thread, blocking in Future.get(), and
the blocking time may be as long as several hours:
!get1.png!
!get2.png!
*Solution:*
When the local SNN is ready to put a FsImage to the peer NN, it need to test
whether he really need to put it at this time. The test process is:
# Establish an HTTP connection with the peer NN, send a put request, and then
immediately read the response (this is the key point). If the peer NN replies
with any of the following errors (TransferResult.AUTHENTICATION_FAILURE,
TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult.
# If the peer NN is truly the ANN and can receive the FsImage normally, it
will reply to the local SNN with an HTTP response 410
(HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE). At
this time, the local SNN can really begin to put the image.
*Note:*
This problem needs to be reproduced in a large cluster (the size of FsImage in
our cluster is about 30G). Therefore, unit testing is difficult to write. In
our real cluster, after the modification, the problem has been solved. There is
no such thing as a large backlog of Send-Q.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]