[ 
https://issues.apache.org/jira/browse/HDFS-10819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HDFS-10819:
--------------------------------------
    Attachment: HDFS-10819.001.patch

*Problem:*
— BlockManager reports incorrect replica count for a file block even after 
successful replication to all replicas, 
— TestDataNodeHotSwapVolumes fails with “TimeoutException: Timed out waiting 
for /test to reach 3 replicas” error

*Analysis:*
- Client wrote data to DN1 as part of the initial write pipeline DN1 -> Dn2 -> 
DN3 
— DN1 persisted (say in storage volume *S1*) the block BLK_xyz_001, mirrored 
the block to downstreams and was waiting for the ack back.
- Later, one of the storage volumes in DN1 (say S2) was removed. Client detects 
pipeline issue, triggers pipeline recovery and gets the new write pipeline as 
DN2 —> DN3 
— On a successful {{FSNameSystem::updatePipeline}} request from Client, 
NameNode bumps up the Generation Stamp (from 001 to 002) of the 
UnderConstruction (that is, the last) block of the file.
- Client writes the new allocated Block BLK_xyz_002 to the new write pipeline 
nodes. (DN2 and DN3) 
- Client closed the file stream. NameNode ran the LowRedundancy checker for all 
the blocks in the file. Detected the block BLK_xyz having a replication factor 
of 2 Vs the expected 3.
- NameNode asked DN2 to replicate BLK_xyz_002 to DN1. Say DN1 persisted 
BLK_xyz_002 onto storage volume *S1* again.
- Now DN1 sends IBR to NameNode with the RECEIVED_BLOCK info about BLK_xyz_002 
on *S1*

- BlockManager processed the incremental block report from DN1, was trying to 
store (metadata) the block BLK_xyz_002 for DN1 on storage *S1*
- But, DN1 S1 already had BLK_xyz_001 and was marked corrupted later as part of 
pipeline update. The check at line 2878 thus failed.
- So, when a storage had a corrupt block and later when the same storage 
reported a good block, BlockManager fails to update block --> datanode mapping 
and prune neededReconstruction list. Refer: {{BlockManager::addStoredBlock}}

{noformat}

  2871   void addStoredBlockUnderConstruction(StatefulBlockInfo ucBlock,    
  2872       DatanodeStorageInfo storageInfo) throws IOException {    
  2873     BlockInfo block = ucBlock.storedBlock;
  2874     block.getUnderConstructionFeature().addReplicaIfNotPresent(    
  2875         storageInfo, ucBlock.reportedBlock, ucBlock.reportedState);
  2876 
  2877     if (ucBlock.reportedState == ReplicaState.FINALIZED &&
  2878         (block.findStorageInfo(storageInfo) < 0)) {    
  2879       addStoredBlock(block, ucBlock.reportedBlock, storageInfo, null, 
true);
  2880     }   
  2881   }   

{noformat}

- Replication Monitor which runs continuously tried to reconstruct the block on 
DN1, but {{BlockPlacementPolicyDefault}} failed to find the choose the same 
target

{noformat}
1148 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN  net.NetworkTopology 
(NetworkTopology.java:chooseRandom(816)) - Failed to find datanode (scope="" 
excludedScope="/default-rack").
1149 2016-08-25 18:21:19,853 [ReplicationMonitor] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough 
replicas, still in need of 1 to reach 3 (unavailableStorages=[], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more 
information, please enable DEBUG log level on 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1150 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  net.NetworkTopology 
(NetworkTopology.java:chooseRandom(816)) - Failed to find datanode (scope="" 
excludedScope="/default-rack").
1151 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough 
replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) For more 
information, please enable DEBUG log level on 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
1152 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  
protocol.BlockStoragePolicy (BlockStoragePolicy.java:chooseStorageTypes(161)) - 
Failed to place enough replicas: expected size is 1 but only 0 storage types 
can be selected (replication=3, selected=[], unavailable=[DISK, ARCHIVE], 
removed=[DISK],      policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
1153 2016-08-25 18:21:19,854 [ReplicationMonitor] WARN  
blockmanagement.BlockPlacementPolicy 
(BlockPlacementPolicyDefault.java:chooseTarget(402)) - Failed to place enough 
replicas, still in need of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=false) All 
required storage types are unavailable:  unavailableStorages=[DISK, ARCHIVE], 
storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
1154 2016-08-25 18:21:19,854 [ReplicationMonitor] DEBUG BlockStateChange 
(BlockManager.java:computeReconstructionWorkForBlocks(1680)) - BLOCK* 
neededReconstruction = 1 pendingReconstruction = 0
{noformat}


*Fix:*

- {{BlockManager::addStoredBlockUnderConstruction}} should not check for block 
--> datanode storage mapping for invoking BlockManager::addStoredBlock
- {{BlockManager::addStoredBlock}} already handles Block 
addition/replacement/already_exists cases. And, more importantly it also prunes 
the {{LowRedundancyBlocks}} list

Attached patch has the fix. Also, updated the unit test 
TestDataNodeHotSwapVolumes#testRemoveVolumeBeingWrittenForDatanode to expose 
race conditions which helped to recreate the above problem frequently. With the 
proposed fix, BlockManager handles the case properly and the test passes.




> BlockManager fails to store a good block for a datanode storage after it 
> reported a corrupt block — block replication stuck
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10819
>                 URL: https://issues.apache.org/jira/browse/HDFS-10819
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Manoj Govindassamy
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-10819.001.patch
>
>
> TestDataNodeHotSwapVolumes occasionally fails in the unit test 
> testRemoveVolumeBeingWrittenForDatanode.  Data write pipeline can have issues 
> as there could be timeouts, data node not reachable etc, and in this test 
> case it was more of induced one as one of the volumes in a datanode is 
> removed while block write is in progress. Digging further in the logs, when 
> the problem happens in the write pipeline, the error recovery is not 
> happening as expected leading to block replication never catching up.
> Though this problem has same signature as in HDFS-10780, from the logs it 
> looks like the code paths taken are totally different and so the root cause 
> could be different as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to