[ 
https://issues.apache.org/jira/browse/FLINK-35521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853204#comment-17853204
 ] 

EMERSON WANG commented on FLINK-35521:
--------------------------------------

We ran our PyFlink SQL jobs in both AWS EKS & EMR environments, and observed 
the _SUCCESS file generated multiple times from the lambda log files. Use the 
same example as the above, i.e., ts_min=00,

Once the following file associated with the partition -0 - was flushed out to 
S3, _SUCCESS file was generated:
s3://.../dt=2024-06-06/ts_hr=16/ts_min=00/part-c6f5badd-6a38-45c1-8351-06d31605bc1d-0-36608
......

Then, the new file associated with the partition -0 - was flushed out to S3, 
_SUCCESS file was generated again:
s3://.../dt=2024-06-06/ts_hr=16/ts_min=00/part-c6f5badd-6a38-45c1-8351-06d31605bc1d-0-36609
......

Note that the above file suffix, i.e., last 5 digits, are different. The same 
partition, i.e., -0-, may
flush out multiple files, which depend on the partition data size, max file 
size before rolling (default 128 MB) config and so on.

> Flink FileSystem SQL Connector Generating SUCCESS File Multiple Times
> ---------------------------------------------------------------------
>
>                 Key: FLINK-35521
>                 URL: https://issues.apache.org/jira/browse/FLINK-35521
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>    Affects Versions: 1.18.1
>         Environment: Our PyFlink SQL jobs are running in AWS EKS environment.
>            Reporter: EMERSON WANG
>            Priority: Major
>
> Our Flink table SQL job received data from the Kafka streams and then sinked 
> all partitioned data into the associated parquet files under the same S3 
> folder through the filesystem SQL connector.
> For the S3 filesystem SQL connector, sink.partition-commit.policy.kind was 
> set to 'success-file' and sink.partition-commit.trigger was set to 
> 'partition-time'. We found that _SUCCESS file in the S3 folder was generated 
> multiple times after multiple partitions are committed.
> Because all partitioned parquet files and _SUCCESS file are in the same S3 
> folder and _SUCCESS file is used to trigger the downstream application, we 
> really like the _SUCCESS file to be generated only once instead of multiple 
> times after all partitions are committed and all parquet files are ready to 
> be processed. Thus, one _SUCCESS file can be used to trigger the downstream 
> application only once instead of multiple times.
> We knew we could set sink.partition-commit.trigger to 'process-time' to 
> generate _SUCCESS file only once in the S3 folder; however, 'process-time' 
> would not meet our business requirements.
> We'd request the FileSystem SQL connector should support to the following new 
> user case:
> Even if sink.partition-commit.trigger is set to 'partition-time', _SUCCESS 
> file will be generated only once after all partitions are committed and all 
> output files are ready to be processed, and will be used to trigger the 
> downstream application only once instead of multiple times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to