[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

Ayush Saxena (Jira) Tue, 01 Nov 2022 11:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627288#comment-17627288
 ]


Ayush Saxena commented on HDFS-16829:
-------------------------------------

{quote}One suggested improvement is to delay deleting the blocks with old 
generation stamp until the block is replicated sufficiently which satisfies 
replication factor.
{quote}
What if the client still pushed more data to the single datanode? The data 
contained by the other datanodes won't be complete right? You can get something 
if you figure out really quick before the configured time after which the block 
gets deleted. Moreover with this in genuine cases also you would be occupying 
the memory with blocks with old genstamps, if the cluster is really unstable 
and has a lot of updatePipelines or things like that, it may be a issue in that 
case
{quote}3. Namenode updated the pipeline just with the first node.
{quote}
I think there is a minimum configuration as well, say if you configure that to 
two, you won't be able to succeed if it doesn't have minimum 2 datanodes in the 
pipeline. It can prevent you with such situation where one node is screwed. Not 
sure but I need to check if it is: ``

dfs.client.block.write.replace-datanode-on-failure.min-replication``, if it 
doesn't behave that way, good to have a new config with such behaviour 
semantics....

 
{quote}Due to disk write errors or bug in BlockReceiver, nodeA didn't 
flush/sync the last 5MB (64414655-59158528) of data to disk. I assume it 
buffered in memory since nameonode got an acknowledgement from the client that 
updatePipeline succeeded.
{quote}
I think if there is one node only, we can give it to try that we have syncBlock 
always true in such cases, might help or  may be not if the Disk has some 
issues in some cases...

 

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-16829
>                 URL: https://issues.apache.org/jira/browse/HDFS-16829
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.10.1
>            Reporter: Rushabh Shah
>            Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: <destination>
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>       at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
> ---------------- Logs from namenode  -----------------
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for <file>
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -------------  Logs from nodeB -------------
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
> -------------  Logs from nodeC -------------
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=LAST_IN_PIPELINE terminating
> 2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
> My observation but can't confirm anything since I don't have logs from first 
> node in pipeline (nodeA) or hdfs client.
> Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
> last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory 
> since nameonode got an acknowledgement from the client that updatePipeline 
> succeeded. 
> *Improvement:*
> One suggested improvement is to delay deleting the blocks with old generation 
> stamp until the block is replicated sufficiently which satisfies replication 
> factor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-16829) Delay deleting blocks with older generation stamp until the block is fully replicated.

Reply via email to