[
https://issues.apache.org/jira/browse/HDFS-16836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634479#comment-17634479
]
Lei Yang commented on HDFS-16836:
---------------------------------
Thanks [~hexiaoqiao] for taking look. docheckpoint() can fail for many reasons
and it does two things: 1. replay edit log to generate fsimage in standby; 2.
upload to active.
What we have seen is #1 succeeded but #2 failed so needRollbackCheckpoint is
never set back to false and all the subsequent checkpointings are just
continuously triggering rollback fsimage for RU even after RU is finalized.
This bypasses the checkpoint period and threshold check.
The current logic only reset needRollbackCheckpoint to false if 1 and 2 are all
met:
# docheckpoint succeeds
# RU is in progress. Because RU finalize would rename the fsimage from
IMAGE_ROLLBACK to IMAGE which means
*namesystem.getFSImage().hasRollbackFSImage()* would be false.
{code:java}
if (needRollbackCheckpoint && namesystem.getFSImage().hasRollbackFSImage())
{code}
After RU is finalized, rollback cannot happen. It would not make sense to
generate a rollback image after RU is finalized, right?
{code:java}
private void doWork()
...
boolean needCheckpoint = needRollbackCheckpoint;
if (needCheckpoint) {
LOG.info("Triggering a rollback fsimage for rolling upgrade.");
} else if (uncheckpointed >= checkpointConf.getTxnCount()) {
LOG.info("Triggering checkpoint because there have been " +
uncheckpointed + " txns since the last checkpoint, which " +
"exceeds the configured threshold " +
checkpointConf.getTxnCount());
needCheckpoint = true;
} else if (secsSinceLast >= checkpointConf.getPeriod()) {
LOG.info("Triggering checkpoint because it has been " +
secsSinceLast + " seconds since the last checkpoint, which " +
"exceeds the configured interval " + checkpointConf.getPeriod());
needCheckpoint = true;
}
if (needCheckpoint) {
// on all nodes, we build the checkpoint. However, we only ship the
checkpoint if have a
// rollback request, are the checkpointer, are outside the quiet period.
doCheckpoint();
// reset needRollbackCheckpoint to false only when we finish a ckpt
// for rollback image
if (needRollbackCheckpoint
&& namesystem.getFSImage().hasRollbackFSImage()) {
namesystem.setCreatedRollbackImages(true);
namesystem.setNeedRollbackFsImage(false);
}
lastCheckpointTime = now;
}
} catch (Throwable t) {
LOG.error("Exception in doCheckpoint", t);
}{code}
The point is RU finalize would terminate RU and would not rollback.
> StandbyCheckpointer can still trigger rollback fs image after RU is finalized
> -----------------------------------------------------------------------------
>
> Key: HDFS-16836
> URL: https://issues.apache.org/jira/browse/HDFS-16836
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Reporter: Lei Yang
> Priority: Major
> Labels: pull-request-available
>
> StandbyCheckpointer trigger rollback fsimage when RU is started.
> When ru is started, a flag (needRollbackImage) was set to true during edit
> log replay.
> And it only gets reset to false when doCheckpoint() succeeded.
> Think about following scenario:
> # Start RU, needRollbackImage is set to true.
> # doCheckpoint() failed.
> # RU is finalized.
> # namesystem.getFSImage().hasRollbackFSImage() is always false since
> rollback image cannot be generated once RU is over.
> # needRollbackImage was never set to false.
> # Checkpoints threshold(1m txns) and period(1hr) are not honored.
> {code:java}
> StandbyCheckpointer:
> void doWork() {
> ....
> doCheckpoint();
> // reset needRollbackCheckpoint to false only when we finish a ckpt
> // for rollback image
> if (needRollbackCheckpoint
> && namesystem.getFSImage().hasRollbackFSImage()) {
> namesystem.setCreatedRollbackImages(true);
> namesystem.setNeedRollbackFsImage(false);
> }
> lastCheckpointTime = now;
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]