[ 
https://issues.apache.org/jira/browse/FLUME-2922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375637#comment-15375637
 ] 

Mike Percy commented on FLUME-2922:
-----------------------------------

[~kevinconaway], this looks like a good change. I wouldn't mind committing 
this. However, I left some comments for you on the pull request regarding the 
test.

(Regarding pull requests: I know we were told a long time ago not to use pull 
requests, but I've seen other ASF projects do it recently (such as Spark) and 
they are pretty convenient for code reviews... The biggest downside IMHO is 
that I don't want to apply merge commits, because they make the commit log hard 
to read, so I have to manually squash everything on commit. But I don't mind 
doing that.)

> HDFSSequenceFile Should Sync Writer
> -----------------------------------
>
>                 Key: FLUME-2922
>                 URL: https://issues.apache.org/jira/browse/FLUME-2922
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.6.0
>            Reporter: Kevin Conaway
>            Priority: Critical
>         Attachments: FLUME-2922.patch
>
>
> There is a possibility of losing data with the current HDFS sequence file 
> writer.
> Internally, the `SequenceFile.Writer` buffers data and periodically syncs it 
> to the underlying output stream.  The mechanism for doing this is dependent 
> on whether you are using compression or not but in both scenarios, the 
> key/values are appended to an internal buffer and only flushed to disk after 
> the buffer reaches a certain size.
> Thus it is quite possible for Flume to lose messages if the agent crashes, or 
> is stopped, before the internal buffer is flushed to disk.
> The correct action is to force the writer to sync its internal buffers to the 
> underlying `FSDataOutputStream` first before calling hflush/sync.
> Additionally, I believe we should be calling hsync instead of hflush.  Its my 
> understanding writes with hsync should be more durable which I believe are 
> the semantics we want here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to