KevinyhZou created HUDI-5159:
--------------------------------
Summary: Support write a success file to partition when it
finished in flink streaming append writer
Key: HUDI-5159
URL: https://issues.apache.org/jira/browse/HUDI-5159
Project: Apache Hudi
Issue Type: New Feature
Components: connectors
Reporter: KevinyhZou
When we use flink streaming job to consume data from mq and wtite to hudi
partition table, we can not know when a partition is wite finished. And this is
often necessary to tell the downstream offline task scheduler to run while
partition is finished.
I think we can use the flink watermark mechanism to implment this. As watermark
represents the minimum timestamp in flink streaming job, when the watermark is
greater than the hudi partition time, it always means the data is write
finished to hudi parititon in an ordered streaming data, and then it is the
time to write a success file to the parititon path to represent it finished
wirte.
It can be designed as below.
# Get the field of partitions and values in flink append streaming data, this
can be implements in AppendWriteFunction, the emit it to downstream;
# Implement a SuccessFileWriteSink to receive these partition values and store
them to activePartitions set,
# Compare the watermark timestamp and the partition timestamp values converted
from activeParitions set, if the wartermark is greater, set the partition to
finished partitons set;
# Iterate the finished partition set, and get the partition path, and write
success file to it while flink job make checkpoint;
# Store the active partition set and finished partition set in flink state,
avoid the data loss while the job failver.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)