[jira] [Updated] (FLINK-9609) Add bucket ready mechanism for BucketingSink when checkpoint complete

Flink Jira Bot (Jira) Mon, 21 Jun 2021 03:42:12 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Flink Jira Bot updated FLINK-9609:
----------------------------------
      Labels: auto-deprioritized-major auto-unassigned pull-request-available  
(was: auto-unassigned pull-request-available stale-major)
    Priority: Minor  (was: Major)

This issue was labeled "stale-major" 7 ago and has not received any updates so 
it is being deprioritized. If this ticket is actually Major, please raise the 
priority and ask a committer to assign you the issue or revive the public 
discussion.


> Add bucket ready mechanism for BucketingSink when checkpoint complete
> ---------------------------------------------------------------------
>
>                 Key: FLINK-9609
>                 URL: https://issues.apache.org/jira/browse/FLINK-9609
>             Project: Flink
>          Issue Type: New Feature
>          Components: Connectors / Common, Connectors / FileSystem
>    Affects Versions: 1.4.2, 1.5.0
>            Reporter: zhangminglei
>            Priority: Minor
>              Labels: auto-deprioritized-major, auto-unassigned, 
> pull-request-available
>
> Currently, BucketingSink only support {{notifyCheckpointComplete}}. However, 
> users want to do some extra work when a bucket is ready. It would be nice if 
> we can support {{BucketReady}} mechanism for users or we can tell users when 
> a bucket is ready for use. For example, One bucket is created for every 5 
> minutes, at the end of 5 minutes before creating the next bucket, the user 
> might need to do something as the previous bucket ready, like sending the 
> timestamp of the bucket ready time to a server or do some other stuff.
> Here, Bucket ready means all the part files suffix name under a bucket 
> neither {{.pending}} nor {{.in-progress}}. Then we can think this bucket is 
> ready for user use. Like a watermark means no elements with a timestamp older 
> or equal to the watermark timestamp should arrive at the window. We can also 
> refer to the concept of watermark here, or we can call this *BucketWatermark* 
> if we could.
> Recently, I found a user who wants this functionality which I would think.
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Let-BucketingSink-roll-file-on-each-checkpoint-td19034.html
> Below is what he said:
> My user case is we read data from message queue, write to HDFS, and our ETL 
> team will use the data in HDFS. *In the case, ETL need to know if all data is 
> ready to be read accurately*, so we use a counter to count how many data has 
> been wrote, if the counter is equal to the number we received, we think HDFS 
> file is ready. We send the counter message in a custom sink so ETL can know 
> how many data has been wrote, but if use current BucketingSink, even through 
> HDFS file is flushed, ETL may still cannot read the data. If we can close 
> file during checkpoint, then the result is accurately. And for the HDFS small 
> file problem, it can be controller by use bigger checkpoint interval. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-9609) Add bucket ready mechanism for BucketingSink when checkpoint complete

Reply via email to