[ 
https://issues.apache.org/jira/browse/HDFS-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18049067#comment-18049067
 ] 

ConfX edited comment on HDFS-17860 at 1/21/26 10:00 PM:
--------------------------------------------------------

Ping --> Any thoughts?


was (Author: JIRAUSER296392):
It seems like the root cause is due to I restart the NN in the test method.

And the NN restart will call the BlocksMap.close(), which sets `blocks = null`.

 

One improvement is to add a defense-in-depth approach instead of NPE. One 
message we can throw would be IllegalStateException("BlocksMap closed").

Here are all the methods that access blocks field:

 
||Method||Access||Status||
|close()|blocks = null|Safe (intentional)|
|clear()|blocks != null check|Safe|
|addBlockCollection()|blocks.get(), blocks.put()|UNSAFE|
|removeBlock(BlockInfo)|blocks.remove()|UNSAFE|
|getStoredBlock()|blocks.get()|UNSAFE|
|getStorages(Block)|blocks.get()|UNSAFE|
|numNodes()|blocks.get()|UNSAFE|
|removeNode()|blocks.get(), blocks.remove()|UNSAFE|
|size()|blocks != null check|Safe|
|getBlocks()|return blocks|  UNSAFE| 
 |

 

 The cleanest approach is to add a single private helper method that all other 
methods use:     
{code:java}
  /**                                                                           
   * Get the blocks map, throwing an informative exception if closed.           
   * This provides better error messages than NullPointerException.             
   */                                                                           
  private GSet<Block, BlockInfo> getBlocksInternal() {                          
 
    GSet<Block, BlockInfo> b = blocks;                                          
 
    if (b == null) {                                                            
 
      throw new IllegalStateException(                                          
 
          "BlocksMap has been closed and null.");                   
    }                                                                           
    return b;
  }{code}
 

  Then update all unsafe methods to use getBlocksInternal() instead of blocks:

 
{code:java}
  // Before (line 146)
  BlockInfo getStoredBlock(Block b) {
    return blocks.get(b);
  }
  // After
  BlockInfo getStoredBlock(Block b) {
    return getBlocksInternal().get(b);
  }
  // Before (line 171-173)
  int numNodes(Block b) {
    BlockInfo info = blocks.get(b);
    return info == null ? 0 : info.numNodes();
  }
  // After
  int numNodes(Block b) {
    BlockInfo info = getBlocksInternal().get(b);
    return info == null ? 0 : info.numNodes();
  }
{code}

> NPE in BlocksMap After NameNode Concat and Restart
> --------------------------------------------------
>
>                 Key: HDFS-17860
>                 URL: https://issues.apache.org/jira/browse/HDFS-17860
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.3.5
>            Reporter: ConfX
>            Assignee: ConfX
>            Priority: Critical
>              Labels: BlocksMap, HDFS
>         Attachments: reproduce.sh, restart.patch
>
>
> h2. Overview 
> A NullPointerException occurs in `BlocksMap.numNodes()` when calling 
> `getBlockLocations()` on a file that has been created via `concat()` 
> operation, after a NameNode restart. This is a critical production bug that 
> could crash the NameNode during normal file read operations.
> h2. Reproduction
> h3. Quick Start with Bash Script
> Use the provided `reproduce.sh` script to automatically reproduce this bug:
> This script automates the entire reproduction process by cloning Hadoop 
> 3.3.5, applying the test patch, building the project, and running the failing 
> test case.
>  
> {code:java}
> # Make sure you download both reproduce.sh and restart.patch
> $ ./reproduce.sh{code}
>  
> *What the script does:*
> 1. Clones the Hadoop repository (release 3.3.5 branch)
> 2. Applies the test patch (`restart.patch`) that adds the reproduction test
> 3. Builds the Hadoop HDFS module
> 4. Runs the test case `TestHDFSConcat#testConcatWithRestart` which 
> demonstrates the NullPointerException
>  
> The bug is confirmed if the test fails with a `NullPointerException` in 
> `BlocksMap.numNodes()`.
> *Manual Reproduction Steps:*
> If you prefer to run the test manually:
> {code:java}
> mvn surefire:test 
> -Dtest=TestHDFSConcat_RestartInjected#testConcat_AfterConcat_NN_Crash{code}
> *Test Scenario*
> 1. Create a target file (`/trg`) with 3 blocks (512 bytes each)
> 2. Create 10 source files, each with 3 blocks
> 3. Call `dfs.concat(trgPath, files)` to concatenate all source files into 
> target
> 4. Restart the NameNode
> 5. Call `nn.getBlockLocations(trg, 0, trgLen)` on the concatenated file
> 6. NPE occurs at `BlocksMap.numNodes()`
>   
> *Stack Trace*
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.numNodes(BlocksMap.java:172)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1420)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1382)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:1353)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1503)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769){code}
> h2. Root Cause Analysis
> the root cause is due to I restart the NN in the test method.
> And the NN restart will call the BlocksMap.close(), which sets `blocks = 
> null`.
>  
> One improvement is to add a defense-in-depth approach instead of NPE. One 
> message we can throw would be IllegalStateException("BlocksMap closed").
> Here are all the methods that access blocks field:
>  
> ||Method||Access||Status||
> |close()|blocks = null|Safe (intentional)|
> |clear()|blocks != null check|Safe|
> |addBlockCollection()|blocks.get(), blocks.put()|UNSAFE|
> |removeBlock(BlockInfo)|blocks.remove()|UNSAFE|
> |getStoredBlock()|blocks.get()|UNSAFE|
> |getStorages(Block)|blocks.get()|UNSAFE|
> |numNodes()|blocks.get()|UNSAFE|
> |removeNode()|blocks.get(), blocks.remove()|UNSAFE|
> |size()|blocks != null check|Safe|
> |getBlocks()|return blocks|  UNSAFE| 
>  |
>  
>  The cleanest approach is to add a single private helper method that all 
> other methods use:     
> {code:java}
>   /**                                                                         
>   
>    * Get the blocks map, throwing an informative exception if closed.         
>   
>    * This provides better error messages than NullPointerException.           
>   
>    */                                                                         
>   
>   private GSet<Block, BlockInfo> getBlocksInternal() {                        
>    
>     GSet<Block, BlockInfo> b = blocks;                                        
>    
>     if (b == null) {                                                          
>    
>       throw new IllegalStateException(                                        
>    
>           "BlocksMap has been closed and null.");                   
>     }                                                                         
>   
>     return b;
>   }{code}
>  
>   Then update all unsafe methods to use getBlocksInternal() instead of blocks:
>  
> {code:java}
>   // Before (line 146)
>   BlockInfo getStoredBlock(Block b) {
>     return blocks.get(b);
>   }
>   // After
>   BlockInfo getStoredBlock(Block b) {
>     return getBlocksInternal().get(b);
>   }
>   // Before (line 171-173)
>   int numNodes(Block b) {
>     BlockInfo info = blocks.get(b);
>     return info == null ? 0 : info.numNodes();
>   }
>   // After
>   int numNodes(Block b) {
>     BlockInfo info = getBlocksInternal().get(b);
>     return info == null ? 0 : info.numNodes();
>   }{code}
> 5. `INodeFile.java` - File inode and block array management
>  
> I'm more than happy to discuss the potential root cause and fix!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to