[
https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323202#comment-15323202
]
ASF GitHub Bot commented on FLUME-2922:
---------------------------------------
GitHub user kevinconaway opened a pull request:
https://github.com/apache/flume/pull/52
FLUME-2922 Sync SequenceFile.Writer before calling hflush
@harishreedharan will you please review?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/kevinconaway/flume flume-2922
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flume/pull/52.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #52
----
commit f03a2406bf44a8300522c1941293e2d74df88d28
Author: Kevin Conaway <[email protected]>
Date: 2016-06-09T19:50:13Z
FLUME-2922 Sync SequenceFile.Writer before calling hflush
----
> HDFSSequenceFile Should Sync Writer
> -----------------------------------
>
> Key: FLUME-2922
> URL: https://issues.apache.org/jira/browse/FLUME-2922
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.6.0
> Reporter: Kevin Conaway
> Priority: Critical
>
> There is a possibility of losing data with the current HDFS sequence file
> writer.
> Internally, the `SequenceFile.Writer` buffers data and periodically syncs it
> to the underlying output stream. The mechanism for doing this is dependent
> on whether you are using compression or not but in both scenarios, the
> key/values are appended to an internal buffer and only flushed to disk after
> the buffer reaches a certain size.
> Thus it is quite possible for Flume to lose messages if the agent crashes, or
> is stopped, before the internal buffer is flushed to disk.
> The correct action is to force the writer to sync its internal buffers to the
> underlying `FSDataOutputStream` first before calling hflush/sync.
> Additionally, I believe we should be calling hsync instead of hflush. Its my
> understanding writes with hsync should be more durable which I believe are
> the semantics we want here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)