rstest created HDFS-17932:
-----------------------------

             Summary: SecondaryNameNode checkpoint stuck in retry infinitely 
after rolling upgrade
                 Key: HDFS-17932
                 URL: https://issues.apache.org/jira/browse/HDFS-17932
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode, rolling upgrades
    Affects Versions: 3.4.2, 3.3.6, 2.10.2
            Reporter: rstest


# Summary

SecondaryNameNode checkpoint retry can fail indefinitely during non-HA rolling 
upgrade after replaying `OP_ROLLING_UPGRADE_START` twice

# Bug Symptom

During a non-HA HDFS rolling upgrade from Hadoop 2.10.2 to Hadoop 3.3.6, the 
SecondaryNameNode can become stuck failing checkpoint retries after a transient 
NameNode RPC failure.

The failure occurs when the SecondaryNameNode has already replayed 
`OP_ROLLING_UPGRADE_START` while merging a checkpoint, then the checkpoint 
merge fails later when the SecondaryNameNode calls 
`namenode.isRollingUpgrade()`. On retry, the same SecondaryNameNode process 
reloads the checkpoint inputs and replays `OP_ROLLING_UPGRADE_START` again, but 
its local in-memory `FSNamesystem` still has `rollingUpgradeInfo` active from 
the failed first attempt.

The retry then fails with a `RollingUpgradeException`, because 
`FSNamesystem.checkRollingUpgrade("start rolling upgrade")` rejects starting a 
rolling upgrade while one is already in progress.

Expected behavior:

- A transient NameNode RPC failure during SecondaryNameNode checkpoint merge 
should be recoverable.
- A checkpoint retry should not fail because stale in-memory rolling-upgrade 
state from the failed merge attempt remains active.

Actual behavior:

- The running SecondaryNameNode checkpoint loop can remain stuck.
- Checkpoints stop being produced/uploaded by the SecondaryNameNode.
- Edit logs can continue accumulating until the SecondaryNameNode is restarted 
or local checkpoint state is manually cleaned up.

Relevant code path:

- `SecondaryNameNode.doCheckpoint()` calls `doMerge(...)`.
- `SecondaryNameNode.doMerge()` calls 
`Checkpointer.rollForwardByApplyingLogs(...)`.
- Edit replay sees `OP_ROLLING_UPGRADE_START`.
- `FSEditLogLoader` calls `fsNamesys.startRollingUpgradeInternal(startTime)`.
- `startRollingUpgradeInternal` sets `rollingUpgradeInfo` in the 
SecondaryNameNode-local `FSNamesystem`.
- `doMerge()` saves the merged local fsimage, then calls 
`namenode.isRollingUpgrade()`.
- If that RPC fails, `doCheckpoint()` calls `checkpointImage.setMergeError()`.
- Retry reloads/replays image+edits, but the local `rollingUpgradeInfo` state 
is still active.
- Replaying `OP_ROLLING_UPGRADE_START` again throws because rolling upgrade is 
already in progress.

Version pairs tested:

- Hadoop 2.10.2 -> Hadoop 3.3.6: issue observed.
- Hadoop 3.3.6 -> Hadoop 3.4.2: also covered by upgrade testing, but this 
specific failure was observed on 2.10.2 -> 3.3.6.

# How To Reproduce

One way to reproduce is to force a transient NameNode RPC failure during the 
narrow checkpoint window after the SecondaryNameNode has replayed 
`OP_ROLLING_UPGRADE_START` but before `SecondaryNameNode.doMerge()` completes.

1. Start a non-HA HDFS cluster on Hadoop 2.10.2 with:
- one NameNode
- one SecondaryNameNode
- one DataNode

2. Prepare a rolling upgrade on the NameNode.

This should create a rollback image and write a rolling-upgrade START operation 
into the edit log.

3. Upgrade/start the SecondaryNameNode on Hadoop 3.3.6 while the rolling 
upgrade is prepared and not finalized.

4. Trigger or wait for a SecondaryNameNode checkpoint.

The SecondaryNameNode should download the rollback fsimage and edit logs, then 
replay edits containing `OP_ROLLING_UPGRADE_START`.

5. After the SecondaryNameNode has replayed `OP_ROLLING_UPGRADE_START` and 
saved the merged local checkpoint image, but before `doMerge()` completes, make 
the NameNode RPC endpoint temporarily unavailable.

A practical way to induce this is to restart the NameNode at this point. The 
observed failure point is the SecondaryNameNode RPC call to 
`namenode.isRollingUpgrade()`, which can fail with an EOF/connection-closed 
error if the NameNode is down.

6. Bring the NameNode back and let the same SecondaryNameNode process retry 
checkpointing.

7. Observe that the retry reloads/replays the checkpoint inputs and encounters 
`OP_ROLLING_UPGRADE_START` again.

8. The retry fails because the SecondaryNameNode-local `rollingUpgradeInfo` 
from the failed previous merge attempt is still active.

Expected result:

- The SecondaryNameNode retry should recover from the transient RPC failure and 
complete checkpointing.

Actual result:

- The retry fails with a rolling-upgrade already-in-progress exception.
- The same SecondaryNameNode process can continue failing future checkpoint 
attempts until it is restarted or its local state is cleaned.

Representative exception:

```text
org.apache.hadoop.hdfs.protocol.RollingUpgradeException:
Failed to start rolling upgrade since a rolling upgrade is already in progress.
```

Potential fix direction:

- Ensure that checkpoint retry after `setMergeError()` fully resets or reloads 
all SecondaryNameNode-local `FSNamesystem` state affected by edit replay, 
including `rollingUpgradeInfo`.
- Alternatively, make replay of `OP_ROLLING_UPGRADE_START` idempotent in this 
checkpoint-retry context when the existing rolling-upgrade info matches the 
START operation being replayed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to