[
https://issues.apache.org/jira/browse/HDFS-17863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053747#comment-18053747
]
ASF GitHub Bot commented on HDFS-17863:
---------------------------------------
teamconfx opened a new pull request, #8203:
URL: https://github.com/apache/hadoop/pull/8203
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
This PR fixes [HDFS-17863](https://issues.apache.org/jira/browse/HDFS-17863).
The bug accurs where under-construction files become unreadable after
DataNode restart, even though the data was successfully flushed with hflush().
This breaks HDFS's visibility guarantee for flushed data.
When a DataNode restarts, under-construction block replicas in the "rbw"
(replica being written) directory are loaded as ReplicaWaitingToBeRecovered
(RWR state). The getVisibleLength() method in this class unconditionally
returned -1:
```java
// Before (ReplicaWaitingToBeRecovered.java:75-77)
@Override
public long getVisibleLength() {
return -1; //no bytes are visible
}
```
When a client tries to read the file:
1. DFSInputStream calls readBlockLength() to determine the
under-construction block length
2. It contacts the DataNode via getReplicaVisibleLength()
3. The DataNode returns -1 (from RWR replica)
4. Client treats this as invalid and throws
CannotObtainBlockLengthException
This violates HDFS's hflush() contract which guarantees that flushed data
remains visible to readers.
### Changes
Changed ReplicaWaitingToBeRecovered.getVisibleLength() to return
getNumBytes() instead of -1:
```java
// After (ReplicaWaitingToBeRecovered.java:75-77)
@Override
public long getVisibleLength() {
return getNumBytes(); // all bytes are visible since validated on load
}
```
### Why This Fix Is Safe
The fix is safe because the block length returned by getNumBytes() has
already been validated against checksums when the replica is loaded from disk.
In BlockPoolSlice.addReplicaToReplicasMap() (lines 693-700), RWR replicas
are created with a validated length:
```java
if (loadRwr) {
ReplicaBuilder builder = new ReplicaBuilder(ReplicaState.RWR)
.setBlockId(blockId)
.setLength(validateIntegrityAndSetLength(file, genStamp)) // <
> CannotObtainBlockLengthException after DataNode restart
> -------------------------------------------------------
>
> Key: HDFS-17863
> URL: https://issues.apache.org/jira/browse/HDFS-17863
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, dfs, hdfs-client
> Affects Versions: 3.3.5
> Environment: Hadoop 3.3.5
> Java 8
> Maven 3.6.3
> Reporter: rstest
> Priority: Critical
> Attachments: reproduce.sh, restart.patch
>
>
> h2. DESCRIPTION:
> After hflush(), HDFS guarantees that written data becomes visible to readers,
> even while the file remains under construction. This guarantee is BROKEN after
> DataNode restart. Under-construction blocks that have been flushed become
> inaccessible (visible length = -1) until explicit lease recovery, causing
> CannotObtainBlockLengthException when clients try to read the file.
> This is a genuine production bug that affects:
> - HBase WAL recovery after DataNode failures
> - Streaming applications that write and read simultaneously
> - Any application relying on hflush() visibility guarantees
> h2. STEPS TO REPRODUCE:
> Download the `reproduce.sh` and `restart.patch`, then
> {code:java}
> $ bash reproduce.sh{code}
> The script:
> 1. Clones Hadoop repository (release 3.3.5 branch)
> 2. Applies test patch (restart.patch) that adds the reproduction test
> 3. Builds the Hadoop HDFS module
> 4. Runs test case:
> TestBlockToken#testLastLocatedBlockTokenExpiryWithDataNodeRestart
> *EXPECTED RESULT:*
> File should be readable after hflush(), even after DataNode restart
> *ACTUAL RESULT:*
> org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block
> length for LocatedBlock
> The bug is confirmed if the test fails with CannotObtainBlockLengthException.
> *KEY OBSERVATION:*
> - Tests with NameNode-only restart (no DataNode restart) DO NOT fail
> - The bug ONLY occurs when DataNode restarts
> h2. ROOT CAUSE:
> When a DataNode restarts, under-construction block replicas are loaded from
> disk and placed in ReplicaWaitingToBeRecovered (RWR) state:
> File: ReplicaWaitingToBeRecovered.java:75
> @Override
> public long getVisibleLength()
> { return -1; // no bytes are visible }
> This state explicitly returns -1 for visible length, meaning "no bytes
> visible"
> until lease recovery completes.
> When a client tries to open the file:
> 1. DFSInputStream calls readBlockLength() to determine UC block length
> 2. Contacts DataNode via getReplicaVisibleLength()
> 3. Receives -1 (not a valid length)
> 4. Treats this as a failure, tries next DataNode
> 5. All DataNodes return -1
> 6. Throws CannotObtainBlockLengthException
> The problem persists because:
> - Lease is still held by the original client (output stream still open)
> - Client is still alive (from HDFS's perspective)
> - Automatic lease recovery only triggers when lease holder is detected as
> dead
> - No mechanism to automatically recover in this scenario
> h2. *DIAGNOSTIC EVIDENCE:*
> BEFORE DataNode Restart:
> - File under construction: true
> - Block length: 6 bytes
> - Block is complete: false
> - DataNode replica visible length: 6 ✅ READABLE
> AFTER DataNode Restart:
> - File under construction: true
> - Block length: 6 bytes
> - Block is complete: false
> - DataNode replica visible length: -1 ❌ UNREADABLE!
> AFTER Explicit Lease Recovery:
> - File under construction: false
> - Block length: 6 bytes
> - Block is complete: true
> - DataNode replica visible length: 6 ✅ READABLE AGAIN!
> h2. WHY THIS IS A BUG (NOT EXPECTED BEHAVIOR):
> HDFS Guarantees:
> - hflush() ensures data is visible to new readers
> - Under-construction files should be readable after hflush()
> - This is the documented contract for hflush()
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]