kokonguyen191 opened a new pull request, #6759:
URL: https://github.com/apache/hadoop/pull/6759
### Description of PR
Error logs
```
2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830
heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode
(BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool
BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid
1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
java.lang.NullPointerException
at
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
at
org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
at
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
at java.lang.Thread.run(Thread.java:748)
```
The root cause is in `BPOfferService#notifyNamenodeBlock`, happens when it's
called on a block belonging to a volume already removed prior. Because the
volume was already removed
```java
private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
String delHint, String storageUuid, boolean isOnTransientStorage) {
checkBlock(block);
final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
block.getLocalBlock(), status, delHint);
final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
// storage == null here because it's already removed earlier.
for (BPServiceActor actor : bpServices) {
actor.getIbrManager().notifyNamenodeBlock(info, storage,
isOnTransientStorage);
}
}
```
so IBRs with a null storage are now pending.
The reason why notifyNamenodeBlock can trigger on such blocks is up in
DirectoryScanner#reconcile
```java
public void reconcile() throws IOException {
LOG.debug("reconcile start DirectoryScanning");
scan();
// If a volume is removed here after scan() already finished running,
// diffs is stale and checkAndUpdate will run on a removed volume
// HDFS-14476: run checkAndUpdate with batch to avoid holding the lock
too
// long
int loopCount = 0;
synchronized (diffs) {
for (final Map.Entry<String, ScanInfo> entry : diffs.getEntries()) {
dataset.checkAndUpdate(entry.getKey(), entry.getValue());
...
}
```
Inside `checkAndUpdate`, `memBlockInfo` is null because all the block meta
in memory is removed during the volume removal, but `diskFile` still exists.
Then `DataNode#notifyNamenodeDeletedBlock` (and further down the line,
`notifyNamenodeBlock`) is called on this block.
This patch effectively contains 2 parts:
* Preventing processing IBRs with null storage
* Preventing `FsDatasetImpl#checkAndUpdate` from running on a block
belonging to a removed storage
### How was this patch tested?
Unit tests. Partially on production cluster.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]