[
https://issues.apache.org/jira/browse/FLINK-6272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-6272:
----------------------------------
Labels: auto-deprioritized-major auto-deprioritized-minor (was:
auto-deprioritized-major stale-minor)
Priority: Not a Priority (was: Minor)
This issue was labeled "stale-minor" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Minor, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Rolling file sink saves incomplete lines on failure
> ---------------------------------------------------
>
> Key: FLINK-6272
> URL: https://issues.apache.org/jira/browse/FLINK-6272
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Common, Connectors / FileSystem
> Affects Versions: 1.2.0
> Environment: Flink 1.2.0, Scala 2.11, Debian GNU/Linux 8.7 (jessie),
> CDH 5.8, YARN
> Reporter: Jakub Nowacki
> Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> We have simple pipeline with Kafka source (0.9), which transforms data and
> writes to Rolling File Sink, which runs on YARN. The sink is a plain HDFS
> sink with StringWriter configured as follows:
> {code:java}
> val fileSink = new BucketingSink[String]("some_path")
> fileSink.setBucketer(new DateTimeBucketer[String]("yyyy-MM-dd"))
> fileSink.setWriter(new StringWriter())
> fileSink.setBatchSize(1024 * 1024 * 1024) // this is 1 GB
> {code}
> Checkpoint is on. Both Kafka source and File sink are in theory with
> [exactly-once
> guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
> On failure in some files, which seem to be complete (not {{in_progress}}
> files ore something, but under 1 GB and confirmed to be created on failure),
> it comes out that the last line is cut. In our case it shows because we save
> the data in line-by-line JSON and this creates invalid JSON line. This does
> not happen always when the but I noticed at least 3 incidents like that at
> least.
> Also, I am not sure if it is a separate bug but we see some data duplication
> in this case coming from Kafka. I.e.after the pipeline is restarted some
> number of messages come out from Kafka source, which already have been saved
> in the previous file. We can check that the messages are duplicated as they
> have same data but different timestamp, which is added within Flink pipeline.
> This should not happen in theory as the sink and source have [exactly-once
> guarantee|https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/connectors/guarantees.html].
--
This message was sent by Atlassian Jira
(v8.20.1#820001)