Reviewing the specification of Sequence File in more detail, it contains
Œsync markers¹ that denote block boundaries.  A sync operation that writes
a partial block is probably not sensible.  Closing/reopening would write
the data to the file system, but still with a performance cost.

On 8/6/15, 8:05 AM, "Aaron.Dossett" <[email protected]> wrote:

>Hello,
>
>I see that when HDFSBolt syncs it takes advantage of the fact that it has
>a direct handle to an HdfsDataOutputStream with the following code:
>
>
>if (this.out instanceof HdfsDataOutputStream) {
>    ((HdfsDataOutputStream)
>this.out).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH));
>} else {
>    this.out.hsync();
>}
>
>SequenceFileBolt, however, has a higher level SequenceFile.Writer and so
>syncs like this:
>
>this.writer.hsync();
>
>
>From looking at the implementation of hsync in DFSOutputStream (Hadoop
>2.6.0) it seems that without passing SyncFlag.UPDATE_LENGTH there is no
>guarantee that namenode.fsync() gets called.
>
>Was that flag added to HDFSBolt to ensure that fsync() is called every
>time?  When I sync my SequenceFileBolt I don¹t always see additional data
>written to the HDFS file, which I do see every single time with HDFSBolt
>syncs.
>
>It seems that to get the same behavior, which is what I want, I have to
>close the SequenceFile and then reopen.  That seems like it will work,
>but at a performance cost.
>
>I would appreciate any feedback on my analysis above or proposed solution.
>
>
>Thanks!

Reply via email to