[
https://issues.apache.org/jira/browse/HDFS-16829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rushabh Shah updated HDFS-16829:
--------------------------------
Priority: Critical (was: Major)
> Delay deleting blocks with older generation stamp until the block is fully
> replicated.
> --------------------------------------------------------------------------------------
>
> Key: HDFS-16829
> URL: https://issues.apache.org/jira/browse/HDFS-16829
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 2.10.1
> Reporter: Rushabh Shah
> Priority: Critical
>
> We encountered this data loss issue in one of our production clusters which
> runs hbase service. We received a missing block alert in this cluster. This
> error was logged in the datanode holding the block.
> {noformat}
> 2022-10-27 18:37:51,341 ERROR [17546151_2244173222]] datanode.DataNode -
> nodeA:51010:DataXceiver error processing READ_BLOCK operation src:
> /nodeA:31722 dst: <destination>
> java.io.IOException: Offset 64410559 and length 4096 don't match block
> BP-958889176-1567030695029:blk_3317546151_2244173222 ( blockLen 59158528 )
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:384)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:603)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:145)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:100)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:298)
> at java.lang.Thread.run(Thread.java:750)
> {noformat}
> The node +nodeA+ has this block blk_3317546151_2244173222 with file length:
> 59158528 but the length of this block according to namenode is 64414655
> (according to fsck)
> This are the sequence of events for this block.
>
> 1. Namenode created a file with 3 replicas with block id: blk_3317546151 and
> genstamp: 2244173147.
> 2. The first datanode in the pipeline (This physical host was also running
> region server process which was hdfs client) was restarting at the same time.
> Unfortunately this node was sick and it didn't log anything neither in
> datanode process or regionserver process during the time of block creation.
> 3. Namenode updated the pipeline just with the first node.
> 4. Namenode logged updatePipeline success with just 1st node nodeA with block
> size: 64414655 and new generation stamp: 2244173222
> 5. Namenode asked nodeB and nodeC to delete the block since it has old
> generation stamp.
> 6. All the reads (client reads and data transfer reads) from nodeA are
> failing with the above stack trace.
> See logs below from namenode and nodeB and nodeC.
> {noformat}
> ---------------- Logs from namenode -----------------
> 2022-10-23 12:36:34,449 INFO [on default port 8020] hdfs.StateChange -
> BLOCK* allocate blk_3317546151_2244173147, replicas=nodeA:51010, nodeB:51010
> , nodeC:51010 for <file>
> 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem -
> updatePipeline(blk_3317546151_2244173147 => blk_3317546151_2244173222) success
> 2022-10-23 12:36:34,978 INFO [on default port 8020] namenode.FSNamesystem -
> updatePipeline(blk_3317546151_2244173147, newGS=2244173222,
> newLength=64414655, newNodes=[nodeA:51010],
> client=DFSClient_NONMAPREDUCE_1038417265_1)
> 2022-10-23 12:36:35,004 INFO [on default port 8020] hdfs.StateChange - DIR*
> completeFile: <file> is closed by DFSClient_NONMAPREDUCE_1038417265_1
> {noformat}
> {noformat}
> ------------- Logs from nodeB -------------
> 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode -
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655
> from nodeA:30302
> 2022-10-23 12:36:35,084 INFO [0.180.160.231:51010]] datanode.DataNode -
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147,
> type=HAS_DOWNSTREAM_IN_PIPELINE, downstreams=1:[nodeC:51010] terminating
> 2022-10-23 12:36:39,738 INFO [/data-2/hdfs/current]
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029
> blk_3317546151_2244173147 file
> /data-2/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
>
> {noformat}
> ------------- Logs from nodeC -------------
> 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode -
> Received BP-958889176-1567030695029:blk_3317546151_2244173147 size 64414655
> from nodeB:56486
> 2022-10-23 12:36:34,985 INFO [ype=LAST_IN_PIPELINE] datanode.DataNode -
> PacketResponder: BP-958889176-1567030695029:blk_3317546151_2244173147,
> type=LAST_IN_PIPELINE terminating
> 2022-10-23 12:36:38,891 INFO [/data-1/hdfs/current]
> impl.FsDatasetAsyncDiskService - Deleted BP-958889176-1567030695029
> blk_3317546151_2244173147 file
> /data-1/hdfs/current/BP-958889176-1567030695029/current/finalized/subdir189/subdir188/blk_3317546151
> {noformat}
> My observation but can't confirm anything since I don't have logs from first
> node in pipeline (nodeA) or hdfs client.
> Due to disk write errors or bug in BlockReceiver, nodeA didn't flush/sync the
> last 5MB (64414655-59158528) of data to disk. I assume it buffered in memory
> since nameonode got an acknowledgement from the client that updatePipeline
> succeeded.
> *Improvement:*
> One suggested improvement is to delay deleting the blocks with old generation
> stamp until the block is replicated sufficiently which satisfies replication
> factor.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]