Jakub Nowacki created FLINK-6272:
------------------------------------
Summary: Rolling file sink saves incomplete lines on failure
Key: FLINK-6272
URL: https://issues.apache.org/jira/browse/FLINK-6272
Project: Flink
Issue Type: Bug
Components: filesystem-connector, Streaming Connectors
Affects Versions: 1.2.0
Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie),
CDH 5.8, YARN
Reporter: Jakub Nowacki
We have simple pipeline with Kafka source (0.9), which transforms data and
writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS sink
with StringWriter configured as follows:
{code:java}
val fileSink = new BucketingSink[String]("some_path")
fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
fileSink.setWriter(new StringWriter())
fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
{code}
Checkpoint is on. Both Kafka source and File sink are in theory with
[exactly-once
guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
On failure in some files, which seem to be complete (not {{in_progress}} files
ore something, but under 1 GB and confirmed to be created on failure), it comes
out that the last line is cut. In our case it shows because we save the data in
line-by-line JSON and this creates invalid JSON line. This does not happen
always when the but I noticed at least 3 incidents like that at least.
Also, I am not sure if it is a separate bug but we see some data duplication in
this case coming from Kafka. I.e.after the pipeline is restarted some number of
messages come out from Kafka source, which already have been saved in the
previous file. We can check that the messages are duplicated as they have same
data but different timestamp, which is added within Flink pipeline. This should
not happen in theory as the sink and source have [exactly-once
guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)