Reviewing the specification of Sequence File in more detail, it contains Œsync markers¹ that denote block boundaries. A sync operation that writes a partial block is probably not sensible. Closing/reopening would write the data to the file system, but still with a performance cost.
On 8/6/15, 8:05 AM, "Aaron.Dossett" <[email protected]> wrote: >Hello, > >I see that when HDFSBolt syncs it takes advantage of the fact that it has >a direct handle to an HdfsDataOutputStream with the following code: > > >if (this.out instanceof HdfsDataOutputStream) { > ((HdfsDataOutputStream) >this.out).hsync(EnumSet.of(SyncFlag.UPDATE_LENGTH)); >} else { > this.out.hsync(); >} > >SequenceFileBolt, however, has a higher level SequenceFile.Writer and so >syncs like this: > >this.writer.hsync(); > > >From looking at the implementation of hsync in DFSOutputStream (Hadoop >2.6.0) it seems that without passing SyncFlag.UPDATE_LENGTH there is no >guarantee that namenode.fsync() gets called. > >Was that flag added to HDFSBolt to ensure that fsync() is called every >time? When I sync my SequenceFileBolt I don¹t always see additional data >written to the HDFS file, which I do see every single time with HDFSBolt >syncs. > >It seems that to get the same behavior, which is what I want, I have to >close the SequenceFile and then reopen. That seems like it will work, >but at a performance cost. > >I would appreciate any feedback on my analysis above or proposed solution. > > >Thanks!
