[
https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manoj Govindassamy updated HDFS-10819:
--------------------------------------
Attachment: HDFS-10819.001.patch
*Problem:*
— BlockManager reports incorrect replica count for a file block even after
successful replication to all replicas,
— TestDataNodeHotSwapVolumes fails with “TimeoutException: Timed out waiting
for /test to reach 3 replicas” error
*Analysis:*
- Client wrote data to DN1 as part of the initial write pipeline DN1 -> Dn2 ->
DN3
— DN1 persisted (say in storage volume *S1*) the block BLK_xyz_001, mirrored
the block to downstreams and was waiting for the ack back.
- Later, one of the storage volumes in DN1 (say S2) was removed. Client detects
pipeline issue, triggers pipeline recovery and gets the new write pipeline as
DN2 —> DN3
— On a successful {{FSNameSystem::updatePipeline}} request from Client,
NameNode bumps up the Generation Stamp (from 001 to 002) of the
UnderConstruction (that is, the last) block of the file.
- Client writes the new allocated Block BLK_xyz_002 to the new write pipeline
nodes. (DN2 and DN3)
- Client closed the file stream. NameNode ran the LowRedundancy checker for all
the blocks in the file. Detected the block BLK_xyz having a replication factor
of 2 Vs the expected 3.
- NameNode asked DN2 to replicate BLK_xyz_002 to DN1. Say DN1 persisted
BLK_xyz_002 onto storage volume *S1* again.
- Now DN1 sends IBR to NameNode with the RECEIVED_BLOCK info about BLK_xyz_002
on *S1*
- BlockManager processed the incremental block report from DN1, was trying to
store (metadata) the block BLK_xyz_002 for DN1 on storage *S1*
- But, DN1 S1 already had BLK_xyz_001 and was marked corrupted later as part of
pipeline update. The check at line 2878 thus failed.
- So, when a storage had a corrupt block and later when the same storage
reported a good block, BlockManager fails to update block --> datanode mapping
and prune neededReconstruction list. Refer: {{BlockManager::addStoredBlock}}
{noformat}
2871 void addStoredBlockUnderConstruction(StatefulBlockInfo ucBlock,
2872 DatanodeStorageInfo storageInfo) throws IOException {
2873 BlockInfo block = ucBlock.storedBlock;
2874 block.getUnderConstructionFeature().addReplicaIfNotPresent(
2875 storageInfo, ucBlock.reportedBlock, ucBlock.reportedState);
2876
2877 if (ucBlock.reportedState == ReplicaState.FINALIZED &&
2878 (block.findStorageInfo(storageInfo) < 0)) {
2879 addStoredBlock(block, ucBlock.reportedBlock, storageInfo, null,
true);
2880 }
2881 }
{noformat}
- Replication Monitor which runs continuously tried to reconstruct the block on
DN1, but {{BlockPlacementPolicyDefault}} failed to find the choose the same
target
{noformat}
1148 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN net.NetworkTopology
(NetworkTopology.java:chooseRandom(816)) - Failed to find datanode (scope=""
excludedScope="/default-rack").
1149 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough
replicas, still in need of 1 to reach 3 (unavailableStorages=[],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more
information, please enable DEBUG log level on
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1150 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN net.NetworkTopology
(NetworkTopology.java:chooseRandom(816)) - Failed to find datanode (scope=""
excludedScope="/default-rack").
1151 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough
replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more
information, please enable DEBUG log level on
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1152 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) -
Failed to place enough replicas: expected size is 1 but only 0 storage types
can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE],
removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
1153 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN
blockmanagement.BlockPlacementPolicy
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough
replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All
required storage types are unavailable: unavailableStorages=[DISK, ARCHIVE],
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK],
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
1154 2016-08-25 18:21:19,854 [ReplicationMonitor] DEBUG BlockStateChange
(BlockManager.java:computeReconstructionWorkForBlocks(1680)) - BLOCK*
neededReconstruction = 1 pendingReconstruction = 0
{noformat}
*Fix:*
- {{BlockManager::addStoredBlockUnderConstruction}} should not check for block
--> datanode storage mapping for invoking BlockManager::addStoredBlock
- {{BlockManager::addStoredBlock}} already handles Block
addition/replacement/already_exists cases. And, more importantly it also prunes
the {{LowRedundancyBlocks}} list
Attached patch has the fix. Also, updated the unit test
TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWrittenForDatanode to expose
race conditions which helped to recreate the above problem frequently. With the
proposed fix, BlockManager handles the case properly and the test passes.
> BlockManager fails to store a good block for a datanode storage after it
> reported a corrupt block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10819
> URL: https://issues.apache.org/jira/browse/HDFS-10819
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Manoj Govindassamy
> Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test
> testRemoveVolumeBeingWrittenForDatanode. Data write pipeline can have issues
> as there could be timeouts, data node not reachable etc, and in this test
> case it was more of induced one as one of the volumes in a datanode is
> removed while block write is in progress. Digging further in the logs, when
> the problem happens in the write pipeline, the error recovery is not
> happening as expected leading to block replication never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it
> looks like the code paths taken are totally different and so the root cause
> could be different as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]