Tao Yang created HDFS-15105:
-------------------------------
Summary: Standby NN exits and fails to restart due to edit log
corruption
Key: HDFS-15105
URL: https://issues.apache.org/jira/browse/HDFS-15105
Project: Hadoop HDFS
Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Tao Yang
We found a issue that Standby NN exited and failed to restart until we resolved
the edit log corruption.
Error logs:
{noformat}
java.io.IOException: Mismatched block IDs or generation stamps, attempting to
replace block blk_74288647857_73526148211 with blk_74288647857_73526377369 as
block # 15/17 of /maindump/mainv10/dump_online/lasttable/20200105015500/part-319
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1019)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:431)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:885)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:866)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:234)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:342)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:295)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:312)
at
org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455)
at
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:308)
{noformat}
Related edit log transactions of the same file:
{noformat}
1. TXID=444341628498 time=1578251449632
OP_UPDATE_BLOCKS
blocks: ... blk_74288647857_73526148211 blk_74454090866_73526215536
2. TXID=444342382774 time=1578251520740
OP_REASSIGN_LEASE
3. TXID=444342401216 time=1578251522779
OP_CLOSE
blocks: ... blk_74288647857_73526377369 blk_74454090866_73526374095
4. TXID=444342401394
OP_SET_GENSTAMP_V2
generate stamp: 73526377369 !!!! this stamp is generated but already used in
the previous edit log
5. TXID=444342401395 time=1578251522835 (03:12:02,835)
OP_TRUNCATE
6. TXID=444342402176 time=1578251523246 (03:12:03,246)
OP_CLOSE
blocks: ... blk_74288647857_73526377369
{noformat}
According to the edit logs, it's wield to see that stamp(73526377369) was
generated in transaction 4 but already used in transaction 3, and for
transaction 3 there should be only the last block changed but in fact the last
two blocks are both changed.
This problem might be produced in a complex scenario that truncate operation
immediately followed the recover-lease operation for the same file. A
suspicious point is that between creation and being written for transaction 3,
stamp of the second last block was updated when committing block
synchronization caused by the truncate operation.
Related calling stack is as follows:
{noformat}
NameNodeRpcServer#commitBlockSynchronization
FSNamesystem#commitBlockSynchronization
// update last block
if(!copyTruncate) {
storedBlock.setGenerationStamp(newgenerationstamp); //updated now!!!
storedBlock.setNumBytes(newlength);
}
{noformat}
Any comments are welcome. Thanks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]