[jira] [Updated] (HDFS-4936) Handle overflow condition for txid going over Long.MAX_VALUE

Fengdong Yu (JIRA) Tue, 25 Jun 2013 18:02:13 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-4936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fengdong Yu updated HDFS-4936:
------------------------------

    Description: 
Hat tip to [[email protected]] for the question that lead to this (on mailing 
lists).

I hacked up my local NN's txids manually to go very large (close to max) and 
decided to try out if this causes any harm. I basically bumped up the freshly 
formatted files' starting txid to 9223372036854775805 (and ensured image 
references the same by hex-editing it):

{code}
➜  current  ls
VERSION
fsimage_9223372036854775805.md5
fsimage_9223372036854775805
seen_txid
➜  current  cat seen_txid
9223372036854775805
{code}

NameNode started up as expected.

{code}
13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 
seconds.
13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid 
9223372036854775805 from 
/temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805
13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at 
9223372036854775806
{code}

I could create a bunch of files and do regular ops (counting to much after the 
long max increments). I created over 10 files, just to make it go well over the 
Long.MAX_VALUE.

Quitting NameNode and restarting fails though, with the following error:

{code}
13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized 
segments in 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current
13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806
 -> 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807
13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join
java.io.IOException: Gap in transactions. Expected to be able to read up until 
at least txid 9223372036854775806 but unable to find any edit logs containing 
txid -9223372036854775808
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
{code}

Looks like we also lose some edits when we restart, as noted by the finalized 
edits filename:

{code}
VERSION
edits_9223372036854775806-9223372036854775807
fsimage_9223372036854775805
fsimage_9223372036854775805.md5
seen_txid
{code}

It seems like we won't be able to handle the case where txid overflows. Its a 
very very large number so that's not an immediate concern but seemed worthy of 
a report.

  was:
Hat tip to [[email protected]] for the question that lead to this (on 
mailing lists).

I hacked up my local NN's txids manually to go very large (close to max) and 
decided to try out if this causes any harm. I basically bumped up the freshly 
formatted files' starting txid to 9223372036854775805 (and ensured image 
references the same by hex-editing it):

{code}
➜  current  ls
VERSION
fsimage_9223372036854775805.md5
fsimage_9223372036854775805
seen_txid
➜  current  cat seen_txid
9223372036854775805
{code}

NameNode started up as expected.

{code}
13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 
seconds.
13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid 
9223372036854775805 from 
/temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805
13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at 
9223372036854775806
{code}

I could create a bunch of files and do regular ops (counting to much after the 
long max increments). I created over 10 files, just to make it go well over the 
Long.MAX_VALUE.

Quitting NameNode and restarting fails though, with the following error:

{code}
13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized 
segments in 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current
13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806
 -> 
/Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807
13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join
java.io.IOException: Gap in transactions. Expected to be able to read up until 
at least txid 9223372036854775806 but unable to find any edit logs containing 
txid -9223372036854775808
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616)
        at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
{code}

Looks like we also lose some edits when we restart, as noted by the finalized 
edits filename:

{code}
VERSION
edits_9223372036854775806-9223372036854775807
fsimage_9223372036854775805
fsimage_9223372036854775805.md5
seen_txid
{code}

It seems like we won't be able to handle the case where txid overflows. Its a 
very very large number so that's not an immediate concern but seemed worthy of 
a report.

    
> Handle overflow condition for txid going over Long.MAX_VALUE
> ------------------------------------------------------------
>
>                 Key: HDFS-4936
>                 URL: https://issues.apache.org/jira/browse/HDFS-4936
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.0.0-alpha
>            Reporter: Harsh J
>            Priority: Minor
>
> Hat tip to [[email protected]] for the question that lead to this (on 
> mailing lists).
> I hacked up my local NN's txids manually to go very large (close to max) and 
> decided to try out if this causes any harm. I basically bumped up the freshly 
> formatted files' starting txid to 9223372036854775805 (and ensured image 
> references the same by hex-editing it):
> {code}
> ➜  current  ls
> VERSION
> fsimage_9223372036854775805.md5
> fsimage_9223372036854775805
> seen_txid
> ➜  current  cat seen_txid
> 9223372036854775805
> {code}
> NameNode started up as expected.
> {code}
> 13/06/25 18:30:08 INFO namenode.FSImage: Image file of size 129 loaded in 0 
> seconds.
> 13/06/25 18:30:08 INFO namenode.FSImage: Loaded image for txid 
> 9223372036854775805 from 
> /temp-space/tmp-default/dfs-cdh4/name/current/fsimage_9223372036854775805
> 13/06/25 18:30:08 INFO namenode.FSEditLog: Starting log segment at 
> 9223372036854775806
> {code}
> I could create a bunch of files and do regular ops (counting to much after 
> the long max increments). I created over 10 files, just to make it go well 
> over the Long.MAX_VALUE.
> Quitting NameNode and restarting fails though, with the following error:
> {code}
> 13/06/25 18:31:08 INFO namenode.FileJournalManager: Recovering unfinalized 
> segments in 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current
> 13/06/25 18:31:08 INFO namenode.FileJournalManager: Finalizing edits file 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_inprogress_9223372036854775806
>  -> 
> /Users/harshchouraria/Work/installs/temp-space/tmp-default/dfs-cdh4/name/current/edits_9223372036854775806-9223372036854775807
> 13/06/25 18:31:08 FATAL namenode.NameNode: Exception in namenode join
> java.io.IOException: Gap in transactions. Expected to be able to read up 
> until at least txid 9223372036854775806 but unable to find any edit logs 
> containing txid -9223372036854775808
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.checkForGaps(FSEditLog.java:1194)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1152)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:616)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:267)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:592)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:435)
>       at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:397)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:399)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:433)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:590)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1141)
>       at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1205)
> {code}
> Looks like we also lose some edits when we restart, as noted by the finalized 
> edits filename:
> {code}
> VERSION
> edits_9223372036854775806-9223372036854775807
> fsimage_9223372036854775805
> fsimage_9223372036854775805.md5
> seen_txid
> {code}
> It seems like we won't be able to handle the case where txid overflows. Its a 
> very very large number so that's not an immediate concern but seemed worthy 
> of a report.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-4936) Handle overflow condition for txid going over Long.MAX_VALUE

Reply via email to