[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17389821#comment-17389821
 ] 

Stephen O'Donnell commented on HDFS-15175:
------------------------------------------

Thanks for committing it. It saved me some work today!

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> ------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-15175
>                 URL: https://issues.apache.org/jira/browse/HDFS-15175
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Yicong Cai
>            Assignee: Wan Chang
>            Priority: Critical
>              Labels: NameNode
>             Fix For: 3.4.0, 3.2.3, 3.3.2, 3.2.4
>
>         Attachments: HDFS-15175-trunk.1.patch
>
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-----, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ......
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> <RECORD>
>  <OPCODE>OP_REASSIGN_LEASE</OPCODE>
>  <DATA>
>  <TXID>32625021150</TXID>
>  <LEASEHOLDER>DFSClient_NONMAPREDUCE_-969060727_197760</LEASEHOLDER>
>  <PATH>......</PATH>
>  <NEWHOLDER>DFSClient_NONMAPREDUCE_1000868229_201260</NEWHOLDER>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_CLOSE</OPCODE>
>  <DATA>
>  <TXID>32625023743</TXID>
>  <LENGTH>0</LENGTH>
>  <INODEID>0</INODEID>
>  <PATH>......</PATH>
>  <REPLICATION>3</REPLICATION>
>  <MTIME>1581816135883</MTIME>
>  <ATIME>1581814760398</ATIME>
>  <BLOCKSIZE>536870912</BLOCKSIZE>
>  <CLIENT_NAME></CLIENT_NAME>
>  <CLIENT_MACHINE></CLIENT_MACHINE>
>  <OVERWRITE>false</OVERWRITE>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818644</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  <PERMISSION_STATUS>
>  <USERNAME>da_music</USERNAME>
>  <GROUPNAME>hdfs</GROUPNAME>
>  <MODE>416</MODE>
>  </PERMISSION_STATUS>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_TRUNCATE</OPCODE>
>  <DATA>
>  <TXID>32625024049</TXID>
>  <SRC>......</SRC>
>  <CLIENTNAME>DFSClient_NONMAPREDUCE_1000868229_201260</CLIENTNAME>
>  <CLIENTMACHINE>......</CLIENTMACHINE>
>  <NEWLENGTH>185818644</NEWLENGTH>
>  <TIMESTAMP>1581816136336</TIMESTAMP>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818648</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  </DATA>
>  </RECORD>
> ......
> <RECORD>
>  <OPCODE>OP_CLOSE</OPCODE>
>  <DATA>
>  <TXID>32625024993</TXID>
>  <LENGTH>0</LENGTH>
>  <INODEID>0</INODEID>
>  <PATH>......</PATH>
>  <REPLICATION>3</REPLICATION>
>  <MTIME>1581816138774</MTIME>
>  <ATIME>1581814760398</ATIME>
>  <BLOCKSIZE>536870912</BLOCKSIZE>
>  <CLIENT_NAME></CLIENT_NAME>
>  <CLIENT_MACHINE></CLIENT_MACHINE>
>  <OVERWRITE>false</OVERWRITE>
>  <BLOCK>
>  <BLOCK_ID>5568434562</BLOCK_ID>
>  <NUM_BYTES>185818644</NUM_BYTES>
>  <GENSTAMP>4495417845</GENSTAMP>
>  </BLOCK>
>  <PERMISSION_STATUS>
>  <USERNAME>da_music</USERNAME>
>  <GROUPNAME>hdfs</GROUPNAME>
>  <MODE>416</MODE>
>  </PERMISSION_STATUS>
>  </DATA>
>  </RECORD>
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to