[ 
https://issues.apache.org/jira/browse/HUDI-5095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Forward Xu updated HUDI-5095:
-----------------------------
    Description: 
In some cases where we need a flag to measure the progress of data writing, I 
think it is a reasonable way to store the watermark as an attribute of the hudi 
commit metadata.

One of our scenarios is that Flink writes data to Hudi table in real time, and 
then we use this Hudi table to support batch computation, so we need a flag to 
evaluate whether its partition data is complete.

For example, job1 is scheduled every hour. At 2022-01-19 02:01:00, job1 starts 
to check whether the partition (20220119/01) of hudi_table1 is completed (Flink 
writes data to hudi_table1 in real time). When the watermark properties of 
hudi_table1‘s commit metadata are higher than 2022- 01-19 02:05:00 Update (5 
minutes out of order), we consider partition(20220119/01) as completed and we 
can safely execute Hive or Flink sql for batch computation. (basically insert 
table2 select xx from hudi_table1...)

!image-2022-10-26-16-37-07-343.png!

  was:
In some cases where we need a flag to measure the progress of data writing, I 
think it is a reasonable way to store the watermark as an attribute of the hudi 
commit metadata.

One of our scenarios is that Flink writes data to Hudi table in real time, and 
then we use this Hudi table to support batch computation, so we need a flag to 
evaluate whether its partition data is complete.

For example, job1 is scheduled every hour. At 2022-01-19 02:01:00, job1 starts 
to check whether the partition (20220119/01) of hudi_table1 is completed (Flink 
writes data to hudi_table1 in real time). When the watermark properties of 
hudi_table1‘s commit metadata are higher than 2022- 01-19 02:05:00 Update (5 
minutes out of order), we consider partition(20220119/01) as completed and we 
can safely execute Hive or Flink sql for batch computation. (basically insert 
table2 select xx from hudi_table1...)

 


> Flink: Stores a special watermark(flag) to identify the current progress of 
> writing data
> ----------------------------------------------------------------------------------------
>
>                 Key: HUDI-5095
>                 URL: https://issues.apache.org/jira/browse/HUDI-5095
>             Project: Apache Hudi
>          Issue Type: New Feature
>          Components: flink, flink-sql
>            Reporter: Forward Xu
>            Assignee: Forward Xu
>            Priority: Major
>         Attachments: image-2022-10-26-16-37-07-343.png
>
>
> In some cases where we need a flag to measure the progress of data writing, I 
> think it is a reasonable way to store the watermark as an attribute of the 
> hudi commit metadata.
> One of our scenarios is that Flink writes data to Hudi table in real time, 
> and then we use this Hudi table to support batch computation, so we need a 
> flag to evaluate whether its partition data is complete.
> For example, job1 is scheduled every hour. At 2022-01-19 02:01:00, job1 
> starts to check whether the partition (20220119/01) of hudi_table1 is 
> completed (Flink writes data to hudi_table1 in real time). When the watermark 
> properties of hudi_table1‘s commit metadata are higher than 2022- 01-19 
> 02:05:00 Update (5 minutes out of order), we consider partition(20220119/01) 
> as completed and we can safely execute Hive or Flink sql for batch 
> computation. (basically insert table2 select xx from hudi_table1...)
> !image-2022-10-26-16-37-07-343.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to