Harsh J created HDFS-4936:
-----------------------------
Summary: Handle overflow condition for txid going over
Long.MAX_VALUE
Key: HDFS-4936
URL: https://issues.apache.org/jira/browse/HDFS-4936
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.0.0-alpha
Reporter: Harsh J
Priority: Minor
Hat tip to [[email protected]] for the question that lead to this (on
mailing lists).
I hacked up my local NN's txids manually to go very large (close to max) and
decided to try out if this causes any harm. I basically bumped up the freshly
formatted files' starting txid to 9223372036854775805 (and ensured image
references the same by hex-editing it):
{code}
➜ current ls
VERSION
fsimage_9223372036854775805.md5
fsimage_9223372036854775805
seen_txid
➜ current cat seen_txid
9223372036854775805
{code}
NameNode started up as expected.
{code}
13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0
seconds.
13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid
9223372036854775805 from
/temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805
13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at
9223372036854775806
{code}
I could create a bunch of files and do regular ops (counting to much after the
long max increments). I created over 10 files, just to make it go well over the
Long.MAX_VALUE.
Quitting NameNode and restarting fails though, with the following error:
{code}
13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized
segments in
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current
13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806
->
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807
13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join
java.io.IOException: Gap in transactions. Expected to be able to read up until
at least txid 9223372036854775806 but unable to find any edit logs containing
txid -9223372036854775808
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194)
at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
{code}
Looks like we also lose some edits when we restart, as noted by the finalized
edits filename:
{code}
VERSION
edits_9223372036854775806-9223372036854775807
fsimage_9223372036854775805
fsimage_9223372036854775805.md5
seen_txid
{code}
It seems like we won't be able to handle the case where txid overflows. Its a
very very large number so that's not an immediate concern but seemed worthy of
a report.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira