ConfX created HDFS-17863:
----------------------------

             Summary: CannotObtainBlockLengthException after DataNode restart
                 Key: HDFS-17863
                 URL: https://issues.apache.org/jira/browse/HDFS-17863
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode, dfs, hdfs-client
    Affects Versions: 3.3.5
         Environment: Hadoop 3.3.5

Java 8

Maven 3.6.3
            Reporter: ConfX
         Attachments: reproduce.sh, restart.patch

h2. DESCRIPTION:

After hflush(), HDFS guarantees that written data becomes visible to readers,
even while the file remains under construction. This guarantee is BROKEN after
DataNode restart. Under-construction blocks that have been flushed become
inaccessible (visible length = -1) until explicit lease recovery, causing
CannotObtainBlockLengthException when clients try to read the file.

This is a genuine production bug that affects:
- HBase WAL recovery after DataNode failures
- Streaming applications that write and read simultaneously
- Any application relying on hflush() visibility guarantees
h2. STEPS TO REPRODUCE:

Download the `reproduce.sh` and `restart.patch`, then

 
{code:java}
$ bash reproduce.sh{code}
 

The script:
1. Clones Hadoop repository (release 3.3.5 branch)
2. Applies test patch (restart.patch) that adds the reproduction test
3. Builds the Hadoop HDFS module
4. Runs test case: 
TestBlockToken#testLastLocatedBlockTokenExpiryWithDataNodeRestart

*EXPECTED RESULT:*
File should be readable after hflush(), even after DataNode restart

*ACTUAL RESULT:*
org.apache.hadoop.hdfs.CannotObtainBlockLengthException: Cannot obtain block
length for LocatedBlock

The bug is confirmed if the test fails with CannotObtainBlockLengthException.

*KEY OBSERVATION:*
- Tests with NameNode-only restart (no DataNode restart) DO NOT fail
- The bug ONLY occurs when DataNode restarts
h2. ROOT CAUSE:

When a DataNode restarts, under-construction block replicas are loaded from
disk and placed in ReplicaWaitingToBeRecovered (RWR) state:

File: ReplicaWaitingToBeRecovered.java:75
@Override
public long getVisibleLength() {
  return -1;  // no bytes are visible
}

This state explicitly returns -1 for visible length, meaning "no bytes visible"
until lease recovery completes.

When a client tries to open the file:
1. DFSInputStream calls readBlockLength() to determine UC block length
2. Contacts DataNode via getReplicaVisibleLength()
3. Receives -1 (not a valid length)
4. Treats this as a failure, tries next DataNode
5. All DataNodes return -1
6. Throws CannotObtainBlockLengthException

The problem persists because:
- Lease is still held by the original client (output stream still open)
- Client is still alive (from HDFS's perspective)
- Automatic lease recovery only triggers when lease holder is detected as dead
- No mechanism to automatically recover in this scenario
h2. *DIAGNOSTIC EVIDENCE:*

BEFORE DataNode Restart:
- File under construction: true
- Block length: 6 bytes
- Block is complete: false
- DataNode replica visible length: 6  ✅ READABLE

AFTER DataNode Restart:
- File under construction: true
- Block length: 6 bytes
- Block is complete: false
- DataNode replica visible length: -1  ❌ UNREADABLE!

AFTER Explicit Lease Recovery:
- File under construction: false
- Block length: 6 bytes
- Block is complete: true
- DataNode replica visible length: 6  ✅ READABLE AGAIN!
h2. WHY THIS IS A BUG (NOT EXPECTED BEHAVIOR):

HDFS Guarantees:
- hflush() ensures data is visible to new readers
- Under-construction files should be readable after hflush()
- This is the documented contract for hflush()

Regression Evidence:
The original test (without restart) successfully:
1. Creates file
2. Writes data
3. Calls hflush()
4. Opens file for reading while output stream still open
5. Reads the flushed data

This proves HDFS explicitly supports reading UC files after hflush().
After DataNode restart, this capability is lost - a clear regression.

Inconsistent Behavior:
┌─────────────────────┬─────────────────────────┐
│ Component Restarted │ Result                  │
├─────────────────────┼─────────────────────────┤
│ NameNode only       │ File remains readable   │
│ DataNode only       │ File becomes unreadable │
│ Both                │ File becomes unreadable │
└─────────────────────┴─────────────────────────┘

This inconsistency indicates a bug, not intentional design.
h2. PROPOSED FIX OPTIONS:


*Option 1: Change RWR State Behavior*
------------------------------------
Modify ReplicaWaitingToBeRecovered to return the actual visible length based
on bytes on disk, rather than -1:

@Override
public long getVisibleLength() {
  // Instead of: return -1;
  return getBytesOnDisk();  // Return actual flushed data length
}

Pros: Restores hflush() visibility guarantee
Cons: May have implications for lease recovery protocol

*Option 2: Automatic Lease Recovery on Read*
-------------------------------------------
When DFSInputStream encounters visible length = -1, automatically trigger
lease recovery:

if (n == -1 && locatedblock.isUnderConstruction()) {
  dfsClient.recoverLease(src);
  // Retry getting visible length
}

Pros: Transparent recovery
Cons: May interfere with active writers, lease conflicts

*Option 3: Better Replica State on Recovery*
-------------------------------------------
When DataNode loads UC blocks from disk after restart, determine if they
should be in RWR or a different state that preserves visible length:

// If block has been flushed (meta file shows visible length > 0)
// Use ReplicaBeingWritten or similar state instead of RWR

Pros: Preserves visibility without changing recovery protocol
Cons: Requires tracking flush state in metadata



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to