Jakub Nowacki created FLINK-6272:
------------------------------------

             Summary: Rolling file sink saves incomplete lines on failure
                 Key: FLINK-6272
                 URL: https://issues.apache.org/jira/browse/FLINK-6272
             Project: Flink
          Issue Type: Bug
          Components: filesystem-connector, Streaming Connectors
    Affects Versions: 1.2.0
         Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie), 
CDH 5.8, YARN
            Reporter: Jakub Nowacki


We have simple pipeline with Kafka source (0.9), which transforms data and 
writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS sink 
with StringWriter configured as follows:
{code:java}
val fileSink = new BucketingSink[String]("some_path")
        fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
        fileSink.setWriter(new StringWriter())
        fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
{code}
Checkpoint is on. Both Kafka source and File sink are in theory with 
[exactly-once 
guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].

On failure in some files, which seem to be complete (not {{in_progress}} files 
ore something, but under 1 GB and confirmed to be created on failure), it comes 
out that the last line is cut. In our case it shows because we save the data in 
line-by-line JSON and this creates invalid JSON line. This does not happen 
always when the  but I noticed at least 3 incidents like that at least.

Also, I am not sure if it is a separate bug but we see some data duplication in 
this case coming from Kafka. I.e.after the pipeline is restarted some number of 
messages come out from Kafka source, which already have been saved in the 
previous file. We can check that the messages are duplicated as they have same 
data but different timestamp, which is added within Flink pipeline. This should 
not happen in theory as the sink and source have [exactly-once 
guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to