[
https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-9609:
----------------------------------
Labels: auto-unassigned pull-request-available stale-major (was:
auto-unassigned pull-request-available)
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Major but is unassigned and neither itself nor its Sub-Tasks have been updated
for 30 days. I have gone ahead and added a "stale-major" to the issue". If this
ticket is a Major, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> Add bucket ready mechanism for BucketingSink when checkpoint complete
> ---------------------------------------------------------------------
>
> Key: FLINK-9609
> URL: https://issues.apache.org/jira/browse/FLINK-9609
> Project: Flink
> Issue Type: New Feature
> Components: Connectors / Common, Connectors / FileSystem
> Affects Versions: 1.4.2, 1.5.0
> Reporter: zhangminglei
> Priority: Major
> Labels: auto-unassigned, pull-request-available, stale-major
>
> Currently, BucketingSink only support {{notifyCheckpointComplete}}. However,
> users want to do some extra work when a bucket is ready. It would be nice if
> we can support {{BucketReady}} mechanism for users or we can tell users when
> a bucket is ready for use. For example, One bucket is created for every 5
> minutes, at the end of 5 minutes before creating the next bucket, the user
> might need to do something as the previous bucket ready, like sending the
> timestamp of the bucket ready time to a server or do some other stuff.
> Here, Bucket ready means all the part files suffix name under a bucket
> neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is
> ready for user use. Like a watermark means no elements with a timestamp older
> or equal to the watermark timestamp should arrive at the window. We can also
> refer to the concept of watermark here, or we can call this *BucketWatermark*
> if we could.
> Recently, I found a user who wants this functionality which I would think.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html
> Below is what he said:
> My user case is we read data from message queue, write to HDFS, and our ETL
> team will use the data in HDFS. *In the case, ETL need to know if all data is
> ready to be read accurately*, so we use a counter to count how many data has
> been wrote, if the counter is equal to the number we received, we think HDFS
> file is ready. We send the counter message in a custom sink so ETL can know
> how many data has been wrote, but if use current BucketingSink, even through
> HDFS file is flushed, ETL may still cannot read the data. If we can close
> file during checkpoint, then the result is accurately. And for the HDFS small
> file problem, it can be controller by use bigger checkpoint interval.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)