[ 
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rushabh Shah updated HDFS-16829:
--------------------------------
    Priority: Critical  (was: Major)

> Delay deleting blocks with older generation stamp until the block is fully 
> replicated.
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-16829
>                 URL: https://issues.apache.org/jira/browse/HDFS-16829
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.10.1
>            Reporter: Rushabh Shah
>            Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which 
> runs hbase service. We received a missing block alert in this cluster. This 
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode - 
> nodeA:51010:DataXceiver error processing READ_BLOCK operation  src: 
> /nodeA:31722 dst: <destination>
> java.io.IOException:  Offset 64410559 and length 4096 don't match block 
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
>       at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
>       at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
>       at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length: 
> 59158528 but the length of this block according to namenode is 64414655 
> (according to fsck)
> This are the sequence of events for this block.
>  
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and 
> genstamp: 2244173147. 
> 2. The first datanode in the pipeline (This physical host was also running 
> region server process which was hdfs client) was restarting at the same time. 
> Unfortunately this node was sick and it didn't log anything neither in 
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block 
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old 
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are 
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
> ---------------- Logs from namenode  -----------------
> 2022-10-23 12:36:34,449 INFO  [on default port 8020] hdfs.StateChange - 
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010 
> , nodeC:51010 for <file>
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO  [on default port 8020] namenode.FSNamesystem - 
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222, 
> newLength=64414655, newNodes=[nodeA:51010], 
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO  [on default port 8020] hdfs.StateChange - DIR* 
> completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> -------------  Logs from nodeB -------------
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO  [0.180.160.231:51010]] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO  [/data-2/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>  
> {noformat}
> -------------  Logs from nodeC -------------
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655 
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO  [ype=LAST_IN_PIPELINE] datanode.DataNode - 
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147, 
> type=LAST_IN_PIPELINE terminating
> 2022-10-23 12:36:38,891 INFO  [/data-1/hdfs/current] 
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029 
> blk_3317546151_2244173147 file 
> /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
> My observation but can't confirm anything since I don't have logs from first 
> node in pipeline (nodeA) or hdfs client.
> Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the 
> last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory 
> since nameonode got an acknowledgement from the client that updatePipeline 
> succeeded. 
> *Improvement:*
> One suggested improvement is to delay deleting the blocks with old generation 
> stamp until the block is replicated sufficiently which satisfies replication 
> factor.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to