[
https://issues.apache.org/jira/browse/FLINK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853204#comment-17853204
]
EMERSON WANG commented on FLINK-35521:
--------------------------------------
We ran our PyFlink SQL jobs in both AWS EKS & EMR environments, and observed
the _SUCCESS file generated multiple times from the lambda log files. Use the
same example as the above, i.e., ts_min=00,
Once the following file associated with the partition -0 - was flushed out to
S3, _SUCCESS file was generated:
s3://.../dt=2024-06-06/ts_hr=16/ts_min=00/part-c6f5badd-6a38-45c1-8351-06d31605bc1d-0-36608
......
Then, the new file associated with the partition -0 - was flushed out to S3,
_SUCCESS file was generated again:
s3://.../dt=2024-06-06/ts_hr=16/ts_min=00/part-c6f5badd-6a38-45c1-8351-06d31605bc1d-0-36609
......
Note that the above file suffix, i.e., last 5 digits, are different. The same
partition, i.e., -0-, may
flush out multiple files, which depend on the partition data size, max file
size before rolling (default 128 MB) config and so on.
> Flink FileSystem SQL Connector Generating SUCCESS File Multiple Times
> ---------------------------------------------------------------------
>
> Key: FLINK-35521
> URL: https://issues.apache.org/jira/browse/FLINK-35521
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Affects Versions: 1.18.1
> Environment: Our PyFlink SQL jobs are running in AWS EKS environment.
> Reporter: EMERSON WANG
> Priority: Major
>
> Our Flink table SQL job received data from the Kafka streams and then sinked
> all partitioned data into the associated parquet files under the same S3
> folder through the filesystem SQL connector.
> For the S3 filesystem SQL connector, sink.partition-commit.policy.kind was
> set to 'success-file' and sink.partition-commit.trigger was set to
> 'partition-time'. We found that _SUCCESS file in the S3 folder was generated
> multiple times after multiple partitions are committed.
> Because all partitioned parquet files and _SUCCESS file are in the same S3
> folder and _SUCCESS file is used to trigger the downstream application, we
> really like the _SUCCESS file to be generated only once instead of multiple
> times after all partitions are committed and all parquet files are ready to
> be processed. Thus, one _SUCCESS file can be used to trigger the downstream
> application only once instead of multiple times.
> We knew we could set sink.partition-commit.trigger to 'process-time' to
> generate _SUCCESS file only once in the S3 folder; however, 'process-time'
> would not meet our business requirements.
> We'd request the FileSystem SQL connector should support to the following new
> user case:
> Even if sink.partition-commit.trigger is set to 'partition-time', _SUCCESS
> file will be generated only once after all partitions are committed and all
> output files are ready to be processed, and will be used to trigger the
> downstream application only once instead of multiple times.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)