Copilot commented on code in PR #17818: URL: https://github.com/apache/hudi/pull/17818#discussion_r2677145145
########## website/blog/2026-01-09-hudi-11-deep-dive-flink-async-instant-gen.md: ########## @@ -0,0 +1,147 @@ +--- +title: "Apache Hudi 1.1 Deep Dive: Async Instant Time Generation for Flink Writers" +excerpt: "Explore how Hudi 1.1 introduces asynchronous instant generation for Flink writers to eliminate throughput fluctuations in streaming ingestion." +authors: [shuo-cheng] +category: blog +image: /assets/images/blog/2026-01-09-hudi-11-deep-dive-flink-async-instant-gen/async-instant-gen.png +tags: + - hudi + - flink + - streaming +--- + +--- + +_This blog was translated from the [original blog in Chinese](https://mp.weixin.qq.com/s/3r06DFdkaiGkF1_7NiovZw)._ + +--- + +## Background + +Before the Hudi 1.1 release, in order to guarantee the exactly-once semantics of the Hudi Flink sink, a new instant could only be generated after the previous instant was successfully committed to Hudi. During this period, Flink writers had to block and wait. Starting from Hudi 1.1, we introduce a new asynchronous instant generation mechanism for Flink writers. This approach allows writers to request the next instant even before the previous one has been committed successfully. At the same time, it still ensures the ordering and consistency of multi-transaction commits. In the following sections, we will first briefly introduce some of Hudi's basic concepts, and then dive into the details of asynchronous instant time generation. + +## Instant Time + +Timeline is a core component of Hudi's architecture. It serves as the single source of truth for the table's state, recording all operations performed on a table. Each operation is identified by a commit with a monotonically increasing instant time, which indicates the start time of each transaction. + + + +Hudi provides the following capabilities based on instant time: + +- **More efficient write rollbacks**: Each Hudi commit corresponds to an instant time. The instant timestamp can be used to quickly locate files affected by failed writes. +- **File name-based file slicing**: Since instant time is encoded into file names, Hudi can efficiently perform file slicing across different versions of files within a table. +- **Incremental queries**: Each row in a Hudi table carries a `_hoodie_commit_time` metadata field. This allows incremental queries at any point in the timeline, even when full compaction or cleaning services are running asynchronously. + +## Completion Time + +### File Slicing Based on Instant Time + +Before release 1.0, Hudi organized data files in units called `FileGroup`. Each file group contains multiple `FileSlice`s. Each file slice contains one base file and multiple log files. Every compaction on a file group generates a new file slice. The timestamp in the base file name corresponds to the instant time of the compaction operation that wrote the file. The timestamp in the log file name is the same as the base instant time of the current file slice. Data files with the same instant time belong to the same file slice. + + + +In concurrent write scenarios, the instant time naming convention of log files introduces certain limitations: as asynchronous compaction progresses, the base instant time can change. To ensure that writers can correctly determine the base instant time, the ordering between write commits and compaction scheduling must be enforced—meaning that compaction can only be scheduled when there are no ongoing write operations on the table. Otherwise, a log file might be written with an incorrect base instant time, potentially leading to data loss. As a result, compaction scheduling may block all writers in concurrency mode. + +### File Slicing Based on Completion Time + +To address these issues, starting from version 1.0, Hudi introduced a new file slicing model based on a time interval defined by requested time and completion time. In release 1.x, each commit has two important time concepts: requested time and completion time. All generated timestamps are globally monotonically increasing. The timestamp in the log file name is no longer the base instant time, but rather the requested instant time of the write operation. During the file slicing process, Hudi looks up the completion time for each log file using its instant time and applies a new file slicing rule: + +> _A log file belongs to the file slice with the maximum base requested time that is less than or equal to the log file's completion time._ [5] Review Comment: Missing space between the period and the citation. Should be "time. [5]" instead of "time._[5]" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
