[
https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375690#comment-15375690
]
Mike Percy commented on FLUME-2922:
-----------------------------------
[~hshreedharan], yeah you are right, everything is pretty manual in my process
at this point. Download the PR patch, git am foo.patch, git rebase to squash
and edit the commit message, etc. Do you have a link to the script Spark uses?
[~kevinconaway], sorry, didn't mean to confuse you, the whole PR process is
probably a discussion we should start in its own thread on the dev list :)
Anyway, feel free to use the PR for this issue, for now, since I don't mind
using it as a reviewer this time.
> HDFSSequenceFile Should Sync Writer
> -----------------------------------
>
> Key: FLUME-2922
> URL: https://issues.apache.org/jira/browse/FLUME-2922
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.6.0
> Reporter: Kevin Conaway
> Priority: Critical
> Attachments: FLUME-2922.patch
>
>
> There is a possibility of losing data with the current HDFS sequence file
> writer.
> Internally, the `SequenceFile.Writer` buffers data and periodically syncs it
> to the underlying output stream. The mechanism for doing this is dependent
> on whether you are using compression or not but in both scenarios, the
> key/values are appended to an internal buffer and only flushed to disk after
> the buffer reaches a certain size.
> Thus it is quite possible for Flume to lose messages if the agent crashes, or
> is stopped, before the internal buffer is flushed to disk.
> The correct action is to force the writer to sync its internal buffers to the
> underlying `FSDataOutputStream` first before calling hflush/sync.
> Additionally, I believe we should be calling hsync instead of hflush. Its my
> understanding writes with hsync should be more durable which I believe are
> the semantics we want here.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)