Joydeep Sen Sarma wrote:
i don't see a previous dir right now. i am pretty sure it wasn't there earlier 
as well.

i can send u the ls output offline - but there's nothing in it other than tons of 'subdir*'

I am just checking whether there was an ongoing upgrade at that time.
If not then there is no recovery.

we know how it happened (disk was full due to separate bug. restart could not 
flush VERSION and it went missing,

Yes, this is exactly what HADOOP-2073 fixed.

subsequent restarts failed). what kind of automatic recovery were u expecting? (perhaps there's some option we should be setting, but are not).

I can only recommend to use upgrade in suspicious cases, even if you are not 
actually upgrading the software.
The upgrade creates a "snapshot" so that you could rollback if something goes 
wrong during the startup.

You did the right thing with recovering the version file.
Thanks,
--Konstantin


-----Original Message-----
From: Konstantin Shvachko [mailto:[EMAIL PROTECTED]
Sent: Tue 1/8/2008 10:56 AM
To: hadoop-user@lucene.apache.org
Subject: Re: missing VERSION files leading to failed datanodes
Joydeep,

Do you still have the previous directory? It should be
/var/hadoop/tmp/dfs/data/previous

If you do you can use VERSION file from there.
If not could you please do ls -r /var/hadoop/tmp/dfs/data for me,
Block files are not needed of course.

In any case I am interested in how it happened and why automatic recovery is 
not happening.
Do you have any log messages from the time the data-node first failed?
Was it upgrading at that time?
Any information would be useful.

Thank you,
--Konstantin


Joydeep Sen Sarma wrote:

we are running 0.14.4

the fix won't help me recover the current version files. all i need is the 
storageid. it seems to be stored in some file header somewhere. can u tell me 
how to get it?


-----Original Message-----
From: dhruba Borthakur [mailto:[EMAIL PROTECTED]
Sent: Tue 1/8/2008 10:06 AM
To: hadoop-user@lucene.apache.org
Subject: RE: missing VERSION files leading to failed datanodes

Hi Joydeep,

Which version of hadoop are you running? We had earlier fixed a bug
https://issues.apache.org/jira/browse/HADOOP-2073
in version 0.15.

Thanks,
dhruba

-----Original Message-----
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 08, 2008 9:34 AM
To: hadoop-user@lucene.apache.org; hadoop-user@lucene.apache.org
Subject: RE: missing VERSION files leading to failed datanodes

well - at least i know why this happened. (still looking for a way to
restore the version file).

https://issues.apache.org/jira/browse/HADOOP-2549 is causing disk full
on one of the disks (in spite of du.reserved setting). looks like while
starting up - the VERSION file could not be written and went missing.
that would seem like another bug (writing a tmp file and renaming it to
VERSION file would have prevented this mishap):

2008-01-08 08:24:01,597 ERROR org.apache.hadoop.dfs.DataNode:
java.io.IOException: No space left on device
       at java.io.FileOutputStream.writeBytes(Native Method)
       at java.io.FileOutputStream.write(FileOutputStream.java:260)
       at
sun.nio.cs.StreamEncoder$CharsetSE.writeBytes(StreamEncoder.java:336)
       at
sun.nio.cs.StreamEncoder$CharsetSE.implFlushBuffer(StreamEncoder.java:40
4)
       at
sun.nio.cs.StreamEncoder$CharsetSE.implFlush(StreamEncoder.java:408)
       at sun.nio.cs.StreamEncoder.flush(StreamEncoder.java:152)
       at java.io.OutputStreamWriter.flush(OutputStreamWriter.java:213)
       at java.io.BufferedWriter.flush(BufferedWriter.java:236)
       at java.util.Properties.store(Properties.java:666)
       at
org.apache.hadoop.dfs.Storage$StorageDirectory.write(Storage.java:176)
       at
org.apache.hadoop.dfs.Storage$StorageDirectory.write(Storage.java:164)
       at org.apache.hadoop.dfs.Storage.writeAll(Storage.java:510)
       at
org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java
:146)
       at
org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:243)


-----Original Message-----
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
Sent: Tue 1/8/2008 8:51 AM
To: hadoop-user@lucene.apache.org
Subject: missing VERSION files leading to failed datanodes


2008-01-08 08:36:20,045 ERROR org.apache.hadoop.dfs.DataNode:
org.apache.hadoop.dfs.InconsistentFSStateException: Directory
/var/hadoop/tmp/dfs/data is in an inconsistent state: file VERSION is
invalid.

[EMAIL PROTECTED] data]# ssh hadoop003.sf2p cat
/var/hadoop/tmp/dfs/data/current/VERSION [EMAIL PROTECTED] data]#
any idea why the VERSION file is empty? and how can i regenerate it - or
ask the system to generate a new one without discarding all the blocks?


i had previously shutdown and started dfs once (to debug a different bug
where it's not honoring du.reserved. more on that later).

help appreciated,

Joydeep






Reply via email to