[
https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-9609:
----------------------------------
Labels: auto-deprioritized-major auto-unassigned pull-request-available
(was: auto-unassigned pull-request-available stale-major)
Priority: Minor (was: Major)
This issue was labeled "stale-major" 7 ago and has not received any updates so
it is being deprioritized. If this ticket is actually Major, please raise the
priority and ask a committer to assign you the issue or revive the public
discussion.
> Add bucket ready mechanism for BucketingSink when checkpoint complete
> ---------------------------------------------------------------------
>
> Key: FLINK-9609
> URL: https://issues.apache.org/jira/browse/FLINK-9609
> Project: Flink
> Issue Type: New Feature
> Components: Connectors / Common, Connectors / FileSystem
> Affects Versions: 1.4.2, 1.5.0
> Reporter: zhangminglei
> Priority: Minor
> Labels: auto-deprioritized-major, auto-unassigned,
> pull-request-available
>
> Currently, BucketingSink only support {{notifyCheckpointComplete}}. However,
> users want to do some extra work when a bucket is ready. It would be nice if
> we can support {{BucketReady}} mechanism for users or we can tell users when
> a bucket is ready for use. For example, One bucket is created for every 5
> minutes, at the end of 5 minutes before creating the next bucket, the user
> might need to do something as the previous bucket ready, like sending the
> timestamp of the bucket ready time to a server or do some other stuff.
> Here, Bucket ready means all the part files suffix name under a bucket
> neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is
> ready for user use. Like a watermark means no elements with a timestamp older
> or equal to the watermark timestamp should arrive at the window. We can also
> refer to the concept of watermark here, or we can call this *BucketWatermark*
> if we could.
> Recently, I found a user who wants this functionality which I would think.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html
> Below is what he said:
> My user case is we read data from message queue, write to HDFS, and our ETL
> team will use the data in HDFS. *In the case, ETL need to know if all data is
> ready to be read accurately*, so we use a counter to count how many data has
> been wrote, if the counter is equal to the number we received, we think HDFS
> file is ready. We send the counter message in a custom sink so ETL can know
> how many data has been wrote, but if use current BucketingSink, even through
> HDFS file is flushed, ETL may still cannot read the data. If we can close
> file during checkpoint, then the result is accurately. And for the HDFS small
> file problem, it can be controller by use bigger checkpoint interval.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)