[
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16809427#comment-16809427
]
Wei-Chiu Chuang commented on HDFS-13596:
----------------------------------------
I am reviewing the patch.
{quote}This issue occurs because namenode writes new layout audit log during
upgrading ,but standby namenode can not parse new layout audit log. So We can
writes audit log according to the current layout version.
{quote}
I'm pretty sure you meant to say "edit log" instead of "audit log".
I think not being able to accept EC requests prior to the completion of
upgrade, is a reasonable trade-off. You can check layout version within
{{FSNamesystem#startFileInt}}, reject when CreateFlag.SHOULD_REPLICATE is false
or ecPolicyName is not empty. Other EC RPCs that should be checked include
setErasureCodingPolicy.
No tests. But you have done the manual test so that's ok.
Further notes: if we want to support rolling upgrades, we should define the
"to" and "from" version supported. I've not done any rolling upgrade test
myself. [~ferhui] what "to" and "from" versions do you have? I think if we can
support 2.8 it'll make the most of the community happy.
Release note or documentation, please.
Things that should be documented – minimum supported versions, caveats,
> NN restart fails after RollingUpgrade from 2.x to 3.x
> -----------------------------------------------------
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Reporter: Hanisha Koneru
> Assignee: Fei Hui
> Priority: Critical
> Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch,
> HDFS-13596.003.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails
> while replaying edit logs.
> * After NN is started with rollingUpgrade, the layoutVersion written to
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version
> (so as to support downgrade).
> * When writing transactions to log, NN writes as per the current layout
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
> * So any edit log written after the upgrade and before finalizing the
> upgrade will have the old layout version but the new format of transactions.
> * When NN is restarted and the edit logs are replayed, the NN reads the old
> layout version from the editLog file. When parsing the transactions, it
> assumes that the transactions are also from the previous layout and hence
> skips parsing the erasureCoding bits.
> * This cascades into reading the wrong set of bits for other fields and
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected
> length 16
> at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
> at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
> at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
> at
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
> at
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less
> than the current value (=16389), where newValue=16388
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
> at
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
> at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
> at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> Caused by: java.lang.IllegalStateException: Cannot skip to less than the
> current value (=16389), where newValue=16388
> at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]