[ https://issues.apache.org/jira/browse/FLINK-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Flink Jira Bot updated FLINK-6272: ---------------------------------- Labels: auto-deprioritized-major (was: stale-major) > Rolling file sink saves incomplete lines on failure > --------------------------------------------------- > > Key: FLINK-6272 > URL: https://issues.apache.org/jira/browse/FLINK-6272 > Project: Flink > Issue Type: Bug > Components: Connectors / Common, Connectors / FileSystem > Affects Versions: 1.2.0 > Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), > CDH 5.8, YARN > Reporter: Jakub Nowacki > Priority: Major > Labels: auto-deprioritized-major > > We have simple pipeline with Kafka source (0.9), which transforms data and > writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS > sink with StringWriter configured as follows: > {code:java} > val fileSink = new BucketingSink[String]("some_path") > fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd")) > fileSink.setWriter(new StringWriter()) > fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB > {code} > Checkpoint is on. Both Kafka source and File sink are in theory with > [exactly-once > guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html]. > On failure in some files, which seem to be complete (not {{in_progress}} > files ore something, but under 1 GB and confirmed to be created on failure), > it comes out that the last line is cut. In our case it shows because we save > the data in line-by-line JSON and this creates invalid JSON line. This does > not happen always when the but I noticed at least 3 incidents like that at > least. > Also, I am not sure if it is a separate bug but we see some data duplication > in this case coming from Kafka. I.e.after the pipeline is restarted some > number of messages come out from Kafka source, which already have been saved > in the previous file. We can check that the messages are duplicated as they > have same data but different timestamp, which is added within Flink pipeline. > This should not happen in theory as the sink and source have [exactly-once > guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html]. -- This message was sent by Atlassian Jira (v8.3.4#803005)