[
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135675#comment-17135675
]
Xiaoqiao He commented on HDFS-15175:
------------------------------------
Add [~wanchang] as contributor and assign this issue to him.
> Multiple CloseOp shared block instance causes the standby namenode to crash
> when rolling editlog
> ------------------------------------------------------------------------------------------------
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.9.2
> Reporter: Yicong Cai
> Assignee: Wan Chang
> Priority: Critical
>
>
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774,
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845],
> permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=,
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE,
> txid=32625024993]
> java.io.IOException: File is not under construction: ......
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
> at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
> at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:360)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
> at
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
> at
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>
> {panel:title=Editlog}
> <RECORD>
> <OPCODE>OP_REASSIGN_LEASE</OPCODE>
> <DATA>
> <TXID>32625021150</TXID>
> <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
> <PATH>......</PATH>
> <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
> </DATA>
> </RECORD>
> ......
> <RECORD>
> <OPCODE>OP_CLOSE</OPCODE>
> <DATA>
> <TXID>32625023743</TXID>
> <LENGTH>0</LENGTH>
> <INODEID>0</INODEID>
> <PATH>......</PATH>
> <REPLICATION>3</REPLICATION>
> <MTIME>1581816135883</MTIME>
> <ATIME>1581814760398</ATIME>
> <BLOCKSIZE>536870912</BLOCKSIZE>
> <CLIENT_NAME></CLIENT_NAME>
> <CLIENT_MACHINE></CLIENT_MACHINE>
> <OVERWRITE>false</OVERWRITE>
> <BLOCK>
> <BLOCK_ID>5568434562</BLOCK_ID>
> <NUM_BYTES>185818644</NUM_BYTES>
> <GENSTAMP>4495417845</GENSTAMP>
> </BLOCK>
> <PERMISSION_STATUS>
> <USERNAME>da_music</USERNAME>
> <GROUPNAME>hdfs</GROUPNAME>
> <MODE>416</MODE>
> </PERMISSION_STATUS>
> </DATA>
> </RECORD>
> ......
> <RECORD>
> <OPCODE>OP_TRUNCATE</OPCODE>
> <DATA>
> <TXID>32625024049</TXID>
> <SRC>......</SRC>
> <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
> <CLIENTMACHINE>......</CLIENTMACHINE>
> <NEWLENGTH>185818644</NEWLENGTH>
> <TIMESTAMP>1581816136336</TIMESTAMP>
> <BLOCK>
> <BLOCK_ID>5568434562</BLOCK_ID>
> <NUM_BYTES>185818648</NUM_BYTES>
> <GENSTAMP>4495417845</GENSTAMP>
> </BLOCK>
> </DATA>
> </RECORD>
> ......
> <RECORD>
> <OPCODE>OP_CLOSE</OPCODE>
> <DATA>
> <TXID>32625024993</TXID>
> <LENGTH>0</LENGTH>
> <INODEID>0</INODEID>
> <PATH>......</PATH>
> <REPLICATION>3</REPLICATION>
> <MTIME>1581816138774</MTIME>
> <ATIME>1581814760398</ATIME>
> <BLOCKSIZE>536870912</BLOCKSIZE>
> <CLIENT_NAME></CLIENT_NAME>
> <CLIENT_MACHINE></CLIENT_MACHINE>
> <OVERWRITE>false</OVERWRITE>
> <BLOCK>
> <BLOCK_ID>5568434562</BLOCK_ID>
> <NUM_BYTES>185818644</NUM_BYTES>
> <GENSTAMP>4495417845</GENSTAMP>
> </BLOCK>
> <PERMISSION_STATUS>
> <USERNAME>da_music</USERNAME>
> <GROUPNAME>hdfs</GROUPNAME>
> <MODE>416</MODE>
> </PERMISSION_STATUS>
> </DATA>
> </RECORD>
> {panel}
>
>
> The block size should be 185818648 in the first CloseOp. When truncate is
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is
> synchronized to the JournalNode in the same batch. The block used by CloseOp
> twice is the same instance, which causes the first CloseOp has wrong block
> size. When SNN rolling Editlog, TruncateOp does not make the file to the
> UnderConstruction state. Then, when the second CloseOp is executed, the file
> is not in the UnderConstruction state, and SNN crashes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]