[ 
https://issues.apache.org/jira/browse/FLINK-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-6272:
----------------------------------
    Labels: auto-deprioritized-major  (was: stale-major)

> Rolling file sink saves incomplete lines on failure
> ---------------------------------------------------
>
>                 Key: FLINK-6272
>                 URL: https://issues.apache.org/jira/browse/FLINK-6272
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / Common, Connectors / FileSystem
>    Affects Versions: 1.2.0
>         Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), 
> CDH 5.8, YARN
>            Reporter: Jakub Nowacki
>            Priority: Major
>              Labels: auto-deprioritized-major
>
> We have simple pipeline with Kafka source (0.9), which transforms data and 
> writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS 
> sink with StringWriter configured as follows:
> {code:java}
> val fileSink = new BucketingSink[String]("some_path")
>         fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
>         fileSink.setWriter(new StringWriter())
>         fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
> {code}
> Checkpoint is on. Both Kafka source and File sink are in theory with 
> [exactly-once 
> guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
> On failure in some files, which seem to be complete (not {{in_progress}} 
> files ore something, but under 1 GB and confirmed to be created on failure), 
> it comes out that the last line is cut. In our case it shows because we save 
> the data in line-by-line JSON and this creates invalid JSON line. This does 
> not happen always when the  but I noticed at least 3 incidents like that at 
> least.
> Also, I am not sure if it is a separate bug but we see some data duplication 
> in this case coming from Kafka. I.e.after the pipeline is restarted some 
> number of messages come out from Kafka source, which already have been saved 
> in the previous file. We can check that the messages are duplicated as they 
> have same data but different timestamp, which is added within Flink pipeline. 
> This should not happen in theory as the sink and source have [exactly-once 
> guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to