[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

Rushabh Shah (Jira) Tue, 01 Nov 2022 11:52:09 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627294#comment-17627294
 ]


Rushabh Shah commented on HDFS-16829:
-------------------------------------

Thank you [~ayushsaxena] for the response.

> What if the client still pushed more data to the single datanode? The data 
> contained by the other datanodes won't be complete right? 
The proposal is to keep the blocks on disk so some operator can manually 
restore them. We don't want to make the replica with old genstamp primary 
replica automatically.

> Moreover with this in genuine cases also you would be occupying the memory 
> with blocks with old genstamps, if the cluster is really unstable and has a 
> lot of updatePipelines or things like that, it may be a issue in that case
IMHO this shouldn't be an issue. We have ReplicationMonitor thread which works 
pretty well. Even in case of many update pipelines, ReplicationMonitor is fast 
enough to replicate the blocks to maintain replication factor. The proposal is 
just to wait deleting the older genstamp blocks until ReplicationMonitor 
replicates that block. If you are concerned about this behavior then we can 
introduce a config key which will default to current behavior and if enabled it 
will delay until it is fully replicated.

> You can get something if you figure out really quick before the configured 
> time after which the block gets deleted. 
In this case, the replica with old genstamp was deleted within 3 seconds after 
the file was closed. One needs to be super human to catch this in production. :)

> dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it 
> doesn't behave that way, good to have a new config with such behaviour 
> semantics....
Thank you for pointing out this config. Let me check this out.

> I think if there is one node only, we can give it to try that we have 
>syncBlock always true in such cases, might help or  may be not if the Disk has 
>some issues in some cases...

Given that I don't have logs either from datanode or hdfs client, I am not able 
to understand what exactly happened? :(

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-16829
>                 URL: https://issues.apache.org/jira/browse/HDFS-16829
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.10.1
>            Reporter: Rushabh Shah
>            Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: <destination>
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>       at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
> ---------------- Logs from namenode  -----------------
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for <file>
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -------------  Logs from nodeB -------------
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
> -------------  Logs from nodeC -------------
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=LAST_IN_PIPELINE terminating
> 2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
> My observation but can't confirm anything since I don't have logs from first 
> node in pipeline (nodeA) or hdfs client.
> Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
> last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory 
> since nameonode got an acknowledgement from the client that updatePipeline 
> succeeded. 
> *Improvement:*
> One suggested improvement is to delay deleting the blocks with old generation 
> stamp until the block is replicated sufficiently which satisfies replication 
> factor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

Reply via email to