[
https://issues.apache.org/jira/browse/HDFS-17860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053745#comment-18053745
]
ASF GitHub Bot commented on HDFS-17860:
---------------------------------------
teamconfx opened a new pull request, #8202:
URL: https://github.com/apache/hadoop/pull/8202
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
This PR fix [HDFS-17860](https://issues.apache.org/jira/browse/HDFS-17860)
When the NameNode restarts, BlocksMap.close() sets blocks = null. Multiple
methods then access this field without null checking, causing an NPE.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
### AI Tooling
If an AI tool was used:
- [ ] The PR includes the phrase "Contains content generated by <tool>"
where <tool> is the name of the AI tool used.
- [ ] My use of AI contributions follows the ASF legal policy
https://www.apache.org/legal/generative-tooling.html
> NPE in BlocksMap After NameNode Concat and Restart
> --------------------------------------------------
>
> Key: HDFS-17860
> URL: https://issues.apache.org/jira/browse/HDFS-17860
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.3.5
> Reporter: rstest
> Assignee: rstest
> Priority: Critical
> Labels: BlocksMap, HDFS
> Attachments: reproduce.sh, restart.patch
>
>
> h2. Overview
> A NullPointerException occurs in `BlocksMap.numNodes()` when calling
> `getBlockLocations()` on a file that has been created via `concat()`
> operation, after a NameNode restart. This is a critical production bug that
> could crash the NameNode during normal file read operations.
> h2. Reproduction
> h3. Quick Start with Bash Script
> Use the provided `reproduce.sh` script to automatically reproduce this bug:
> This script automates the entire reproduction process by cloning Hadoop
> 3.3.5, applying the test patch, building the project, and running the failing
> test case.
>
> {code:java}
> # Make sure you download both reproduce.sh and restart.patch
> $ ./reproduce.sh{code}
>
> *What the script does:*
> 1. Clones the Hadoop repository (release 3.3.5 branch)
> 2. Applies the test patch (`restart.patch`) that adds the reproduction test
> 3. Builds the Hadoop HDFS module
> 4. Runs the test case `TestHDFSConcat#testConcatWithRestart` which
> demonstrates the NullPointerException
>
> The bug is confirmed if the test fails with a `NullPointerException` in
> `BlocksMap.numNodes()`.
> *Manual Reproduction Steps:*
> If you prefer to run the test manually:
> {code:java}
> mvn surefire:test
> -Dtest=TestHDFSConcat_RestartInjected#testConcat_AfterConcat_NN_Crash{code}
> *Test Scenario*
> 1. Create a target file (`/trg`) with 3 blocks (512 bytes each)
> 2. Create 10 source files, each with 3 blocks
> 3. Call `dfs.concat(trgPath, files)` to concatenate all source files into
> target
> 4. Restart the NameNode
> 5. Call `nn.getBlockLocations(trg, 0, trgLen)` on the concatenated file
> 6. NPE occurs at `BlocksMap.numNodes()`
>
> *Stack Trace*
> {code:java}
> java.lang.NullPointerException
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlocksMap.numNodes(BlocksMap.java:172)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1420)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:1382)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:1353)
> at
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1503)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:179)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769){code}
> h2. Root Cause Analysis
> the root cause is due to I restart the NN in the test method.
> And the NN restart will call the BlocksMap.close(), which sets `blocks =
> null`.
>
> One improvement is to add a defense-in-depth approach instead of NPE. One
> message we can throw would be IllegalStateException("BlocksMap closed").
> Here are all the methods that access blocks field:
>
> ||Method||Access||Status||
> |close()|blocks = null|Safe (intentional)|
> |clear()|blocks != null check|Safe|
> |addBlockCollection()|blocks.get(), blocks.put()|UNSAFE|
> |removeBlock(BlockInfo)|blocks.remove()|UNSAFE|
> |getStoredBlock()|blocks.get()|UNSAFE|
> |getStorages(Block)|blocks.get()|UNSAFE|
> |numNodes()|blocks.get()|UNSAFE|
> |removeNode()|blocks.get(), blocks.remove()|UNSAFE|
> |size()|blocks != null check|Safe|
> |getBlocks()|return blocks| UNSAFE|
> |
>
> The cleanest approach is to add a single private helper method that all
> other methods use:
> {code:java}
> /**
>
> * Get the blocks map, throwing an informative exception if closed.
>
> * This provides better error messages than NullPointerException.
>
> */
>
> private GSet<Block, BlockInfo> getBlocksInternal() {
>
> GSet<Block, BlockInfo> b = blocks;
>
> if (b == null) {
>
> throw new IllegalStateException(
>
> "BlocksMap has been closed and null.");
> }
>
> return b;
> }{code}
>
> Then update all unsafe methods to use getBlocksInternal() instead of blocks:
>
> {code:java}
> // Before (line 146)
> BlockInfo getStoredBlock(Block b) {
> return blocks.get(b);
> }
> // After
> BlockInfo getStoredBlock(Block b) {
> return getBlocksInternal().get(b);
> }
> // Before (line 171-173)
> int numNodes(Block b) {
> BlockInfo info = blocks.get(b);
> return info == null ? 0 : info.numNodes();
> }
> // After
> int numNodes(Block b) {
> BlockInfo info = getBlocksInternal().get(b);
> return info == null ? 0 : info.numNodes();
> }{code}
> 5. `INodeFile.java` - File inode and block array management
>
> I'm more than happy to discuss the potential root cause and fix!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]