Kevin Conaway created FLUME-2922:
------------------------------------
Summary: HDFSSequenceFile Should Sync Writer
Key: FLUME-2922
URL: https://issues.apache.org/jira/browse/FLUME-2922
Project: Flume
Issue Type: Bug
Components: Sinks+Sources
Affects Versions: v1.6.0
Reporter: Kevin Conaway
Priority: Critical
There is a possibility of losing data with the current HDFS sequence file
writer.
Internally, the `SequenceFile.Writer` buffers data and periodically syncs it to
the underlying output stream. The mechanism for doing this is dependent on
whether you are using compression or not but in both scenarios, the key/values
are appended to an internal buffer and only flushed to disk after the buffer
reaches a certain size.
Thus it is quite possible for Flume to lose messages if the agent crashes, or
is stopped, before the internal buffer is flushed to disk.
The correct action is to force the writer to sync its internal buffers to the
underlying `FSDataOutputStream` first before calling hflush/sync.
Additionally, I believe we should be calling hsync instead of hflush. Its my
understanding writes with hsync should be more durable which I believe are the
semantics we want here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)