[ 
https://issues.apache.org/jira/browse/HDFS-17649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895745#comment-17895745
 ] 

Kevin Wikant commented on HDFS-17649:
-------------------------------------

Updating my understanding here based on additional testing. See here for more 
details on that testing: HDFS-17658 

Important amendments:
 * an HDFS block being open for write cannot be moved to another datanode as 
part of decommissioning. Instead, the DatanodeAdminManager has logic intended 
to prevent decommissioning a datanode which has blocks that cannot be 
replicated to other datanodes because they are open
 * If DataStreamer (i.e. DFSOutputStream) fails, then the data already appended 
to the stream can be lost & the client may need to replay that data, especially 
when using HDFS decommissioning. Interestingly, the block will not be 
considered committed/finalized after an hflush to the DFSOutputStream. The 
Namenode does not consider the block committed/finalized until the 
DFSOutputStream is closed (there is also an hsync method which I have not yet 
tested the behaviour of).
 * Therefore, one way to think of the problem is as follows:

 ** If the DFSOutputStream fails before being closed, then the client is 
expected to replay all the data sent to the stream (even if there were hflush 
operations). Not sure if this is an intentional behaviour when using HDFS 
decommissioning, but for now it might be acceptable for clients to design 
around this limitation.
 ** If the DFSOutputStream is closed, then at this point HDFS should not lose 
the data which was already committed/finalized. This behaviour is actually not 
currently satisfied by HDFS which is covered in this new JIRA: HDFS-17658

> Improve HDFS DataStreamer client to handle datanode decommissioning
> -------------------------------------------------------------------
>
>                 Key: HDFS-17649
>                 URL: https://issues.apache.org/jira/browse/HDFS-17649
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.4.0
>         Environment: Tested on Hadoop 3.4.0
> I think the limitation still exists on the trunk though
>            Reporter: Kevin Wikant
>            Priority: Major
>
> The HDFS DataStreamer client can handle single datanode failures by 
> failing-over to other datanodes in the block write pipeline.
> However, if "dfs.replication=1" & the one datanode in the block write 
> pipeline is decommissioned, then the HDFS DataStreamer client will not 
> failover to the new datanode holding the block replica.
> If "dfs.replication>1" then the decommissioned datanode(s) will be removed 
> from the block write pipeline & new replacement datanode(s) will be requested 
> from the Namenode.
> However, if "dfs.replication=1" then a new replacement datanode will never be 
> requested from the Namenode. This is counter-intuitive because the block was 
> successfully replicated to another datanode as part of decommissioning & that 
> datanode could be returned by the Namenode to enable the DataStreamer client 
> to continue appending successfuly.
> Relevant code:
>  * 
> [https://github.com/apache/hadoop/blob/7a7b346b0ab60de792ca90dede9ff369fb50d63a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1723]
>  * 
> [https://github.com/apache/hadoop/blob/7a7b346b0ab60de792ca90dede9ff369fb50d63a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1645]
>  * 
> [https://github.com/apache/hadoop/blob/7a7b346b0ab60de792ca90dede9ff369fb50d63a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DataStreamer.java#L1648]
>  
> Repro Steps:
>  # Create an HDFS cluster with "dfs.replication=1"
>  # Create a DataStreamer client & write a file to HDFS
>  # Identify what datanode the block was written to
>  # Decommission that datanode & confirm the block was replicated to another 
> datanode where it is still accessible
>  # Attempt to append the existing DataStreamer client again & observe it will 
> always fail with:
> {quote}All datanodes [DatanodeInfoWithStorage[XYZ]] are bad
> {quote}
> Suggestion:
>  * It seems to me there is an assumption in DataStreamer client that a block 
> is always lost when its block write pipeline of length 1 has the only block 
> replica the client is aware of go "bad"
>  * However, this assumption is not true when the datanode is gracefully 
> decommissioned & the block is replicated to another datanode by the namenode. 
> In this case, the client is not aware this replication occurred when it 
> detects the datanode went "bad"
>  * I think the DataStreamer client could be updated to request a replacement 
> datanode even when "dfs.replication=1". The client can rely on Namenode as 
> source-of-truth to determine if a block replica is still available somewhere 
> (with the required generation stamp, etc...)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to