[
https://issues.apache.org/jira/browse/HDFS-16804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17626888#comment-17626888
]
ASF GitHub Bot commented on HDFS-16804:
---------------------------------------
DaveTeng0 commented on code in PR #5033:
URL: https://github.com/apache/hadoop/pull/5033#discussion_r1009978715
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/ReplicaMap.java:
##########
@@ -166,25 +167,24 @@ void addAll(ReplicaMap other) {
/**
* Merge all entries from the given replica map into the local replica map.
*/
- void mergeAll(ReplicaMap other) {
+ void mergeAll(ReplicaMap other) throws IOException {
Set<String> bplist = other.map.keySet();
for (String bp : bplist) {
checkBlockPool(bp);
try (AutoCloseDataSetLock l =
lockManager.writeLock(LockLevel.BLOCK_POOl, bp)) {
LightWeightResizableGSet<Block, ReplicaInfo> replicaInfos =
other.map.get(bp);
LightWeightResizableGSet<Block, ReplicaInfo> curSet = map.get(bp);
+ if (curSet == null) {
+ // Can't find the block pool id in the replicaMap. Maybe it has been
removed.
Review Comment:
just for myself learning~~ is it possible we can't find block pool id from
the map, but the block pool is not removed yet? (say, if something went wrong
during the deleting and the pool was not removed successfully?)
> AddVolume contains a race condition with shutdown block pool
> ------------------------------------------------------------
>
> Key: HDFS-16804
> URL: https://issues.apache.org/jira/browse/HDFS-16804
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: ZanderXu
> Assignee: ZanderXu
> Priority: Major
> Labels: pull-request-available
>
> Add Volume contains a race condition with shutdown block pool, causing the
> ReplicaMap still contains some blocks belong to the removed block pool.
> And the new volume still contains one unused BlockPoolSlice belongs to the
> removed block pool, caused some problems, such as: incorrect dfsUsed,
> incorrect numBlocks of the volume.
> Let's review the logic of addVolume and shutdownBlockPool respectively.
>
> AddVolume Logic:
> * Step1: Get all namespaceInfo from blockPoolManager
> * Step2: Create one temporary FsVolumeImpl object
> * Step3: Create some blockPoolSlice according to the namespaceInfo and add
> them to the temporary FsVolumeImpl object
> * Step4: Scan all blocks of the namespaceInfo from the volume and store them
> by one temporary ReplicaMap
> * Step5: Active the temporary FsVolumeImpl which created before (with
> FsDatasetImpl synchronized lock)
> ** Step5.1: Merge all blocks of the temporary ReplicaMap to the global
> ReplicaMap
> ** Step5.2: Add the FsVolumeImpl to the volumes
> ShutdownBlockPool Logic:(with blockPool write lock)
> * Step1: Cleanup the blockPool from the global ReplicaMap
> * Step2: Shutdown the block pool from all the volumes
> ** Step2.1: do some clean operations for the block pool, such as
> saveReplica, saveDfsUsed, etc
> ** Step2.2: remove the blockPool from bpSlices
> The race condition can be reproduced by the following steps:
> * AddVolume Step1: Get all namespaceInfo from blockPoolManager
> * ShutdownBlockPool Step1: Cleanup the blockPool from the global ReplicaMap
> * ShutdownBlockPool Step2: Shutdown the block pool from all the volumes
> * AddVolume Step 2~5
> And result:
> * The global replicaMap contains some blocks belong to the removed blockPool
> * The bpSlices of the FsVolumeImpl contains one blockPoolSlice belong to the
> removed blockPool
> Expected result:
> * The global replicaMap shouldn't contain any blocks belong to the removed
> blockPool
> * The bpSlices of any FsVolumeImpl shouldn't contain any blockPoolSlice
> belong to the removed blockPool
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]