[
https://issues.apache.org/jira/browse/HDFS-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991998#comment-16991998
]
Chen Liang edited comment on HDFS-15036 at 12/9/19 10:36 PM:
-
Spent some time debugging this issue, I think I found the cause of the issue.
In HDFS-12979, we introduced a logic that, if a image being uploaded is not too
far ahead of the previous image, this image upload request is rejected. This is
to prevent the scenario when there are multiple SbNs, all SbNs upload images to
ANN too frequently. This is considered as correct behavior, so there is no
logging indication of any error or anything here (the being "silent" part).
Both ANN and SbN simply ignore and proceed.
But now it appears that, a side effect of this change, is that during RU, the
rollback image also has to go through this check, and it could also be
rejected. If this happens, SbN proceeds assuming upload is done, while ANN
proceeds with still not receiving the rollback image. The upload silently
failed in this case.
The check logic that rejects the upload is in {{ImageServlet}}. In my earlier
test, I just commented out the whole block below and the issue seems gone. But
I think the fix is probably just adding a new check to ensure this rejection
only applies to regular image upload, not rollback image, like the newly added
line in the line in the follow code snippet. But I haven't actually tested
changing it this way.:
{code:java}
if (checkRecentImageEnable &&
NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) &&
// <--- this should fix the issue, as NameNodeFile.IMAGE_ROLLBACK should bypass
this
timeDelta < checkpointPeriod &&
txid - lastCheckpointTxid < checkpointTxnCount) {
// only when at least one of two conditions are met we accept
// a new fsImage
// 1. most recent image's txid is too far behind
// 2. last checkpoint time was too old
response.sendError(HttpServletResponse.SC_CONFLICT,
"Most recent checkpoint is neither too far behind in "
+ "txid, nor too old. New txnid cnt is "
+ (txid - lastCheckpointTxid)
+ ", expecting at least " + checkpointTxnCount
+ " unless too long since last upload.");
return null;
}
{code}
was (Author: vagarychen):
Spent some time debugging this issue, I think I found the cause of the issue.
In HDFS-12979, we introduced a logic that, if a image being uploaded is not too
far ahead of the previous image, this image upload request is rejected. This is
to prevent the scenario when there are multiple SbNs, all SbNs upload images to
ANN too frequently. This is considered as correct behavior, so there is no
logging indication of any error or anything here (the being "silent" part).
Both ANN and SbN simply ignore and proceed.
But now it appears that, a side effect of this change, is that during RU, the
rollback image also has to go through this check, and it could also be
rejected. If this happens, SbN proceeds assuming upload is done, while ANN
proceeds with still not receiving the rollback image. The upload silently
failed in this case.
The check logic that rejects the upload is in {{ImageServlet}}. In my earlier
test, I just commented out the whole block below and the issue seems gone. But
I think the fix is probably just adding a new check to ensure this rejection
only applies to regular image upload, like the newly added line in the line in
the follow code snippet. But I haven't actually tested changing it this way.:
{code}
if (checkRecentImageEnable &&
NameNodeFile.IMAGE.equals(parsedParams.getNameNodeFile()) &&
// <--- this should fix the issue
timeDelta < checkpointPeriod &&
txid - lastCheckpointTxid < checkpointTxnCount) {
// only when at least one of two conditions are met we accept
// a new fsImage
// 1. most recent image's txid is too far behind
// 2. last checkpoint time was too old
response.sendError(HttpServletResponse.SC_CONFLICT,
"Most recent checkpoint is neither too far behind in "
+ "txid, nor too old. New txnid cnt is "
+ (txid - lastCheckpointTxid)
+ ", expecting at least " + checkpointTxnCount
+ " unless too long since last upload.");
return null;
}
{code}
> Active NameNode should not silently fail the image transfer
> ---
>
> Key: HDFS-15036
> URL: