[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389232#comment-17389232 ] Xiaoqiao He edited comment on HDFS-15175 at 7/29/21, 3:30 AM: -- Backport to branch-3.2 first considering 3.2.3 is pending release, we need fix it ASAP. Please let me know if we should backport other active branches. Thanks. was (Author: hexiaoqiao): Backport to branch-3.2 first considering 3.2.3 is pending release, we need fix it ASAP. Please let me know if we should backport other branches. Thanks. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Fix For: 3.4.0, 3.2.3, 3.2.4 > > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389225#comment-17389225 ] Xiaoqiao He edited comment on HDFS-15175 at 7/29/21, 3:23 AM: -- Committed to trunk. Thanks [~caiyicong] report, thanks [~wanchang] for your contribution and thanks every guys' warm discussion here. Thanks [~sodonnell] for your reviews! was (Author: hexiaoqiao): Committed to trunk. Thanks [~caiyicong] report and thanks [~wanchang] for your contribution! Thanks [~sodonnell] for your reviews! > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Fix For: 3.4.0 > > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388487#comment-17388487 ] Max Xie edited comment on HDFS-15175 at 7/28/21, 6:21 AM: --- [~sodonnell] Agree with you . One solution is to deep copy the op. But considering namenode performance, we just deep copy CloseOp block and merge the patch to our hdfs cluster(250+ DNs, 270 million block ) . It run well so far. was (Author: max2049): [~sodonnell] Agree with you . One solution is to deep copy the op. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388487#comment-17388487 ] Max Xie edited comment on HDFS-15175 at 7/28/21, 6:14 AM: --- [~sodonnell] Agree with you . One solution is to deep copy the op. was (Author: max2049): [~sodonnell] Agree with you . One solution is to deep copy the op. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324224#comment-17324224 ] Xiaoqiao He edited comment on HDFS-15175 at 4/17/21, 10:17 AM: --- [~kihwal], [~daryn], [~ahussein] any ideas about this corner case which could be related to async edit logger bug? It is reported many times by different guys. [^HDFS-15175-trunk.1.patch] try to fix it by deepcopy way, it could be one choice. Do you mind to take reviews? was (Author: hexiaoqiao): [~kihwal], [~daryn], [~ahussein] any ideas about this corner case which could be related to async edit logger bug? It is reported many times by different guys. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316997#comment-17316997 ] Max Xie edited comment on HDFS-15175 at 4/8/21, 9:01 AM: -- We encountered this bug on hdfs 3.2.1. Is there any progress now? ping [~hexiaoqiao] [~wanchang] . was (Author: max2049): ping [~hexiaoqiao] [~wanchang] . > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > Labels: NameNode > Attachments: HDFS-15175-trunk.1.patch > > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135675#comment-17135675 ] Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:43 AM: -- Add [~wanchang] as contributor and assign this issue to him. Please feel free to assign back if you are interested it.[~caiyicong] was (Author: hexiaoqiao): Add [~wanchang] as contributor and assign this issue to him. Please assign back if you are interested it.[~caiyicong] > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17135675#comment-17135675 ] Xiaoqiao He edited comment on HDFS-15175 at 6/15/20, 9:42 AM: -- Add [~wanchang] as contributor and assign this issue to him. Please assign back if you are interested it.[~caiyicong] was (Author: hexiaoqiao): Add [~wanchang] as contributor and assign this issue to him. > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Wan Chang >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092532#comment-17092532 ] Wan Chang edited comment on HDFS-15175 at 4/26/20, 6:27 AM: We encountered this bug in our test environment. Seems that it can be fixed by a deep copy of the blocks while generate FSEditlogOp? was (Author: wanchang): We encountered this bug in our test environment. Seems that it can be fixed by a deep copy of the blocks within FSEditLogAsync.logEdit(FSEditLogOp) ? > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org