[ 
https://issues.apache.org/jira/browse/FLINK-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825684#comment-16825684
 ] 

shin chen edited comment on FLINK-12320 at 4/25/19 3:40 AM:
------------------------------------------------------------

Hi. [~kisimple]

When I using SequenceFile.CompressionType.Record, everything goes fine.

But after changing to using the compressionType to Block, there comes a small 
data loss (about 100 - 2000 each topic) every time I restart the job.

It seems this is due to the way Flink take save point. 

The valid length records the length of the in-progress file while canceling the 
job with save point. However, when using compressionType.Block, the record does 
not sink to HDFS immediately while compressing the data. So the commit offset 
record by Flink and valid length saved in save point is an invalid pair. This 
causes a small data loss since Flink truncate the old file length to recorded 
valid length when restarted with save point.

I don't know whether this issue is fixed in Streamingfilesink. 

If this is only for the Bucketingsink or I am wrong with something, please tell 
me. 

Thanks!


was (Author: shinchen):
Hi.

When I using SequenceFile.CompressionType.Record, everything goes fine.

But after changing to using the compressionType to Block, there comes a small 
data loss (about 100 - 2000 each topic) every time I restart the job.

It seems this is due to the way Flink take save point. 

The valid length records the length of the in-progress file while canceling the 
job with save point. However, when using compressionType.Block, the record does 
not sink to HDFS immediately while compressing the data. So the commit offset 
record by Flink and valid length saved in save point is an invalid pair. This 
causes a small data loss since Flink truncate the old file length to recorded 
valid length when restarted with save point.

I don't know whether this issue is fixed in Streamingfilesink. 

If this is only for the Bucketingsink or I am wrong with something, please tell 
me. 

Thanks!

> Flink bucketing sink restart with save point cause data loss when using 
> SequenceFile.CompressionType.BLOCK
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-12320
>                 URL: https://issues.apache.org/jira/browse/FLINK-12320
>             Project: Flink
>          Issue Type: Bug
>            Reporter: shin chen
>            Priority: Minor
>
> When using SequenceFile.CompressionType.BLOCK as the writer of the bucketing 
> sink, there is some data loss even restart with save point.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to