[
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885404#comment-16885404
]
Íñigo Goiri commented on HDFS-14646:
------------------------------------
We run subclusters with 4 Namenodes and I'm not aware of us seeing the issue.
I'll go over the logs to see if it's indeed happening.
At this point the time to load the FSImage is our current time sink so this
might just be overlooked.
HDFS-6440 seems to leave a TODO open which is somewhat solved here.
It looks like an optimization worth adding in any case.
Not sure how to test this... adding synthetic delays might make the running
time too long; lowering the time outs might have some other side effects too.
> Standby NameNode should terminate the FsImage put process immediately if the
> peer NN is not in the appropriate state to receive an image.
> -----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.1.2
> Reporter: Xudong Cao
> Assignee: Xudong Cao
> Priority: Major
> Attachments: blockedInWritingSocket.png, get1.png, get2.png,
> largeSendQ.png
>
>
> *Problem Description:*
> In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put
> the image to all other NNs (whether the peer NN is an ANN or not), and even
> if the peer NN immediately replies with an error (such as
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put
> process immediately, but will put the FsImage completely to the peer NN, and
> will not read the peer NN's reply until the put is completed.
> In a relatively large HDFS cluster, the size of FsImage can often reach about
> 30GB. In this case, this invalid put brings two problems:
> # Wasting time and bandwidth.
> # Since the ImageServlet of the peer NN no longer receives the FsImage, the
> socket Send-Q of the local SNN is very large, and the ImageUpload thread will
> be blocked in writing socket for a long time, eventually causing the local
> StandbyCheckpointer thread often blocked for several hours.
> *An example is as follows:*
> In the following figure, the local NN 100.76.3.234 is a SNN, the peer NN
> 100.76.3.170 is another SNN, and the 8080 is NN Http port. When the local SNN
> starts to put the FsImage, 170 will reply with a NOT_ACTIVE_NAMENODE_FAILURE
> error immediately. In this case, the local SNN should terminate put
> immediately, but in fact, local SNN has to wait until the image has been
> completely put to the peer NN,and then can read the response.
> # At this time, since the ImageServlet of the peer NN no longer receives the
> FsImage, the socket Send-Q of the local SNN is very large:
> !largeSendQ.png!
> 2. Moreover, the local SNN's ImageUpload thread will be blocked in
> writing socket for a long time:
> !blockedInWritingSocket.png! .
>
> 3. Eventually, the StandbyCheckpointer thread of local SNN is waiting
> for the execution result of the ImageUpload thread, blocking in Future.get(),
> and the blocking time may be as long as several hours:
> !get1.png!
>
> !get2.png!
>
>
> *Solution:*
> When the local SNN plans to put a FsImage to the peer NN, it need to test
> whether he really need to put it at this time. The test process is:
> # Establish an HTTP connection with the peer NN, send the put request, and
> then immediately read the response (this is the key point). If the peer NN
> replies any of the following errors (TransferResult.AUTHENTICATION_FAILURE,
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult.
> OLD_TRANSACTION_ID_FAILURE), immediately terminate the put process.
> # If the peer NN is indeed the Active NameNode AND it's now in the
> appropriate state to receive an image, it will reply an HTTP response 410
> (HttpServletResponse.SC_GONE, which is TransferResult.UNEXPECTED_FAILURE). At
> this time, the local SNN can really begin to put the image.
> *Note:*
> This problem needs to be reproduced in a large cluster (the size of FsImage
> in our cluster is about 30GB). Therefore, unit testing is difficult to write.
> In our cluster, after the modification, the problem has been solved and there
> is no such thing as a large backlog of Send-Q.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]