[
https://issues.apache.org/jira/browse/FLINK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825684#comment-16825684
]
shin chen commented on FLINK-12320:
-----------------------------------
Hi.
When I using SequenceFile.CompressionType.Record, everything goes fine.
But after changing to using the compressionType to Block, there comes a small
data loss (about 100 - 2000 each topic) every time I restart the job.
It seems this is due to the way Flink take save point.
The valid length records the length of the in-progress file while canceling the
job with save point. However, when using compressionType.Block, the record does
not sink to HDFS immediately while compressing the data. So the commit offset
record by Flink and valid length saved in save point is an invalid pair. This
causes a small data loss since Flink truncate the old file length to recorded
valid length when restarted with save point.
I don't know whether this issue is fixed in Streamingfilesink.
If this is only for the Bucketingsink or I am wrong with something, please tell
me.
Thanks!
> Flink bucketing sink restart with save point cause data loss when using
> SequenceFile.CompressionType.BLOCK
> ----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-12320
> URL: https://issues.apache.org/jira/browse/FLINK-12320
> Project: Flink
> Issue Type: Bug
> Reporter: shin chen
> Priority: Minor
>
> When using SequenceFile.CompressionType.BLOCK as the writer of the bucketing
> sink, there is some data loss even restart with save point.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)