[ 
https://issues.apache.org/jira/browse/HDFS-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053772#comment-18053772
 ] 

ASF GitHub Bot commented on HDFS-17860:
---------------------------------------

hadoop-yetus commented on PR #8202:
URL: https://github.com/apache/hadoop/pull/8202#issuecomment-3788329772

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m  8s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  44m 12s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 52s |  |  trunk passed with JDK 
Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 47s |  |  trunk passed with JDK 
Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  checkstyle  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 57s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  trunk passed with JDK 
Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  spotbugs  |   4m 12s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  30m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  the patch passed with JDK 
Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK 
Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  0s |  |  the patch passed with JDK 
Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  the patch passed with JDK 
Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  spotbugs  |   3m 44s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  29m 37s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 217m  0s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8202/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 48s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 348m 38s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.52 ServerAPI=1.52 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8202/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/8202 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux a68202c9a11b 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 
20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 52103011582b6c063af54dd880c45e02af29ff3c |
   | Default Java | Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 
/usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8202/1/testReport/ |
   | Max. process+thread count | 3489 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8202/1/console |
   | versions | git=2.25.1 maven=3.9.11 spotbugs=4.9.7 |
   | Powered by | Apache Yetus 0.14.1 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> NPE in BlocksMap After NameNode Concat and Restart
> --------------------------------------------------
>
>                 Key: HDFS-17860
>                 URL: https://issues.apache.org/jira/browse/HDFS-17860
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.3.5
>            Reporter: rstest
>            Assignee: rstest
>            Priority: Critical
>              Labels: BlocksMap, HDFS, pull-request-available
>         Attachments: reproduce.sh, restart.patch
>
>
> h2. Overview 
> A NullPointerException occurs in `BlocksMap.numNodes()` when calling 
> `getBlockLocations()` on a file that has been created via `concat()` 
> operation, after a NameNode restart. This is a critical production bug that 
> could crash the NameNode during normal file read operations.
> h2. Reproduction
> h3. Quick Start with Bash Script
> Use the provided `reproduce.sh` script to automatically reproduce this bug:
> This script automates the entire reproduction process by cloning Hadoop 
> 3.3.5, applying the test patch, building the project, and running the failing 
> test case.
>  
> {code:java}
> # Make sure you download both reproduce.sh and restart.patch
> $ ./reproduce.sh{code}
>  
> *What the script does:*
> 1. Clones the Hadoop repository (release 3.3.5 branch)
> 2. Applies the test patch (`restart.patch`) that adds the reproduction test
> 3. Builds the Hadoop HDFS module
> 4. Runs the test case `TestHDFSConcat#testConcatWithRestart` which 
> demonstrates the NullPointerException
>  
> The bug is confirmed if the test fails with a `NullPointerException` in 
> `BlocksMap.numNodes()`.
> *Manual Reproduction Steps:*
> If you prefer to run the test manually:
> {code:java}
> mvn surefire:test 
> -Dtest=TestHDFSConcat_RestartInjected#testConcat_AfterConcat_NN_Crash{code}
> *Test Scenario*
> 1. Create a target file (`/trg`) with 3 blocks (512 bytes each)
> 2. Create 10 source files, each with 3 blocks
> 3. Call `dfs.concat(trgPath, files)` to concatenate all source files into 
> target
> 4. Restart the NameNode
> 5. Call `nn.getBlockLocations(trg, 0, trgLen)` on the concatenated file
> 6. NPE occurs at `BlocksMap.numNodes()`
>   
> *Stack Trace*
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.numNodes(BlocksMap.java:172)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1420)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1382)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:1353)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1503)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769){code}
> h2. Root Cause Analysis
> the root cause is due to I restart the NN in the test method.
> And the NN restart will call the BlocksMap.close(), which sets `blocks = 
> null`.
>  
> One improvement is to add a defense-in-depth approach instead of NPE. One 
> message we can throw would be IllegalStateException("BlocksMap closed").
> Here are all the methods that access blocks field:
>  
> ||Method||Access||Status||
> |close()|blocks = null|Safe (intentional)|
> |clear()|blocks != null check|Safe|
> |addBlockCollection()|blocks.get(), blocks.put()|UNSAFE|
> |removeBlock(BlockInfo)|blocks.remove()|UNSAFE|
> |getStoredBlock()|blocks.get()|UNSAFE|
> |getStorages(Block)|blocks.get()|UNSAFE|
> |numNodes()|blocks.get()|UNSAFE|
> |removeNode()|blocks.get(), blocks.remove()|UNSAFE|
> |size()|blocks != null check|Safe|
> |getBlocks()|return blocks|  UNSAFE| 
>  |
>  
>  The cleanest approach is to add a single private helper method that all 
> other methods use:     
> {code:java}
>   /**                                                                         
>   
>    * Get the blocks map, throwing an informative exception if closed.         
>   
>    * This provides better error messages than NullPointerException.           
>   
>    */                                                                         
>   
>   private GSet<Block, BlockInfo> getBlocksInternal() {                        
>    
>     GSet<Block, BlockInfo> b = blocks;                                        
>    
>     if (b == null) {                                                          
>    
>       throw new IllegalStateException(                                        
>    
>           "BlocksMap has been closed and null.");                   
>     }                                                                         
>   
>     return b;
>   }{code}
>  
>   Then update all unsafe methods to use getBlocksInternal() instead of blocks:
>  
> {code:java}
>   // Before (line 146)
>   BlockInfo getStoredBlock(Block b) {
>     return blocks.get(b);
>   }
>   // After
>   BlockInfo getStoredBlock(Block b) {
>     return getBlocksInternal().get(b);
>   }
>   // Before (line 171-173)
>   int numNodes(Block b) {
>     BlockInfo info = blocks.get(b);
>     return info == null ? 0 : info.numNodes();
>   }
>   // After
>   int numNodes(Block b) {
>     BlockInfo info = getBlocksInternal().get(b);
>     return info == null ? 0 : info.numNodes();
>   }{code}
> 5. `INodeFile.java` - File inode and block array management
>  
> I'm more than happy to discuss the potential root cause and fix!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to