Hello,
I see that when HDFSBolt syncs it takes advantage of the fact that it has a
direct handle to an HdfsDataOutputStream with the following code:
if (this.out instanceof HdfsDataOutputStream) {
((HdfsDataOutputStream) this.out).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH));
} else {
this.out.hsync();
}
SequenceFileBolt, however, has a higher level SequenceFile.Writer and so syncs
like this:
this.writer.hsync();
>From looking at the implementation of hsync in DFSOutputStream (Hadoop 2.6.0)
>it seems that without passing SyncFlag.UPDATE_LENGTH there is no guarantee
>that namenode.fsync() gets called.
Was that flag added to HDFSBolt to ensure that fsync() is called every time?
When I sync my SequenceFileBolt I don’t always see additional data written to
the HDFS file, which I do see every single time with HDFSBolt syncs.
It seems that to get the same behavior, which is what I want, I have to close
the SequenceFile and then reopen. That seems like it will work, but at a
performance cost.
I would appreciate any feedback on my analysis above or proposed solution.
Thanks!