[PR] HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss [hadoop]

via GitHub Thu, 21 Nov 2024 10:26:34 -0800


KevinWikant opened a new pull request, #7179:
URL: https://github.com/apache/hadoop/pull/7179


   ## Problem
   
   Problem background:
   - A datanode should only enter decommissioned state if all the blocks on the 
datanode are sufficiently replicated to other live datanodes.
   - This expectation is violated for Under Construction blocks which are not 
considered by the DatanodeAdminMonitor at all.
   - DatanodeAdminMonitor currently only considers blocks in the 
DatanodeDescriptor StorageInfos. This is because:
     - For a new HDFS block that was just created, it is not be added to the 
StorageInfos until the HDFS client closes the DFSOutputStream & the block 
becomes finalized
     - For an existing HDFS block that was opened for append:
       - First, the block version with the previous generation stamp is marked 
stale & removed from the StorageInfos
       - Next, the block version with the new generation stamp is not be added 
to the StorageInfos until the HDFS client closes the DFSOutputStream & the 
block becomes finalized
   
   There is logic in the DatanodeAdminManager/DatanodeAdminMonitor to avoid 
transitioning datanodes to decommissioned state when they have open (i.e. Under 
Construction) blocks:
   - 
https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L357
   - 
https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L305
   
   This logic does not work correctly because, as mentioned above, 
[DatanodeAdminMonitor currently only considers blocks in the DatanodeDescriptor 
StorageInfos](https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L385)
 which does not include Under Construction blocks for which the DFSOutputStream 
has not been closed yet.
   
   There is also logic in the HDFS [DataStreamer client which will replace 
bad/dead datanodes in the block write 
pipeline](https://github.com/apache/hadoop/blob/cd2cffe73f909a106ba47653acf525220f2665cf/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1716).
 Note that:
   - this logic does not work if the replication factor is 1
   - if the replication factor is greater than 1, then this logic does not work 
if all the datanodes in the block write pipeline are decommissioned/terminated 
at around the same time
   
   Overall, the Namenode should not be putting datanodes with open blocks into 
decommissioned state & hope that the DataStreamer client is able to replace 
them when the decommissioned datanodes are terminated. This will not work 
depending on the timing & therefore is not a solution which guarantees 
correctness.
   
   The Namenode needs to honor the rule that "a datanode should only enter 
decommissioned state if all the blocks on the datanode are sufficiently 
replicated to other live datanodes", even for blocks which are currently Under 
Construction.
   
   
   ## Potential Solutions
   
   One possible opinion is that if the DFSOutputStream has not been successfuly 
closed yet, then the client should be able to replay all the data if there is a 
failure. The client should not have any expectation the data is committed to 
HDFS until the DFSOutputStream is closed. There are a few reasons I do not 
think this makes sense:
   - The methods hflush/hsync do not result in the data already appended to the 
DFSOutputStream being persisted/finalized. This is confusing when compared the 
standard experience of stream flush/sync methods.
   - This does not handle the case where a block is re-opened by a new 
DFSOutputStream after having been previously closed (by another different 
client). In this case, the problem will lead to data loss for data that was 
previously committed by another client & cannot be replayed.
     - To solve this problem, we could try not removing old block version from 
StorageInfos when a new block version is created; however, this change is 
likely to have wider implications on block management.
   
   Another possible option that comes to mind is to add blocks to StorageInfos 
before they are finalized. However, this change also is likely to have wider 
implications on block management.
   
   Without modifying any existing block management logic, we can add a new data 
structure (UnderConstructionBlocks) which temporarily tracks the Under 
Construction blocks in-memory until they are committed/finalized & added to the 
StorageInfos.
   
   
   ## Solution
   
   Add a new data structure (UnderConstructionBlocks) which temporarily tracks 
the Under Construction blocks in-memory until they are committed/finalized & 
added to the StorageInfos.
   
   Pros:
   - works for newly created HDFS block
   - works for re-opened HDFS block (i.e. opened for append)
   - works for block with any replication factor
   - does not change logic in BlockManager, the new UnderConstructionBlocks 
data structure & associated logic is purely additive
   
   ### How was this patch tested?
   
   `TODO - will add detailed test results`
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [X] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDFS-17658. HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss [hadoop]

Reply via email to