Hi,
I am working on 2.0.2-alpha version of Hadoop. I am currently writing my key
value pairs on HDFS in a sequence file. I regularly flush my data using hsync()
because the process that is writing to the file can terminate abruptly. My
requirement is that once my hsync() is successful, my data that was written
before hsync() should still be available.
To ensure this, I carried out a test that killed the process (that was writing
to a SequenceFile) after this process did a hsync(). Now when I read the data
using "hadoop fs -cat" command, I can see the data. But the size of file is 0.
Also, SequenceFile.Reader.next(key, value) returns me false. I read somewhere
that since file was not closed properly its size was not updated with the
namenode and because of the same reason next() returns false.
To fix this and to enable reading of file using SequenceFile APIs, I opened the
file stream in append mode and then I closed it immediately. This fixed the
size of the file. While doing this, I retry if I receive RecoveryInProgress or
AlreadyBeingCreated exception. Now, I can successfully read data using
SequenceFile.Reader. Following is the code that I am using.
*** WRITE THREAD ***
writer = SequenceFile.createWriter(fs, conf, path, value.getClass(),
value.getClass(), CompressionType.NONE);
writer.append(new Text("India"), new Text("Delhi"));
writer.append(new Text("China"), new Text("Beijing"));
writer.hsync();
// BOOM, FAILURE, PROCESS TERMINATED
*** I expect that India and China Should be available but next returns false***
*** Code to fix the file size ****
while (true) {
try {
FileSystem fs = FileSystem.get(namenodeURI, conf);
Path path = new Path( uri);
FSDataOutputStream open = fs.append(path);
fs.close();
break;
} catch (Recovery In Progress Exception) {
} catch (Already Being Created Exception) {
} catch (Exception) {
break;
}
}
Would it be possible for you to let me know if this approach has any
shortcomings or if there are any other better alternatives available?
Thanks,
Hemant Bhanawat