hililiwei commented on PR #7638: URL: https://github.com/apache/iceberg/pull/7638#issuecomment-1553922039
@stevenzwu Thank you for taking the time to review. The purpose of this PR is to implement a partition commit policy based on Iceberg tables, which has the following differences and advantages compared to proposal #6253: • Proposal #6253 only writes the watermark of the Flink task to the table’s Summary, rather than actually committing partitions. This makes it impossible for downstream application to directly determine which partitions are visible, and they need to calculate it themselves based on the watermark and the each partition time. Moreover, the value of the watermark may decrease due to task restart or data replay, making some partitions that were already visible invisible again, which is unacceptable in a production environment. This PR commits partitions based on the watermark and the event time of the partition, so that downstream application can directly see which partitions are available without extra calculation and judgment. This can improve the query efficiency and experience of downstream application. In scenarios such as BI and ad hoc, it can also avoid the problem of developers forgetting to filter data based on the watermark, and not every developer knows that they need t o use the watermark to filter data, this greatly increases the possibility of them getting illegal data. • This PR allows users to customize the partition commit policy, which can perform some high-level custom operations when committing partitions. In some scenarios, downstream applications need to not only process newly committed partitions, but also deal with data from old partitions that arrive late, and perform some custom operations on the table, such as remote API calls, event notifications, data deletion, etc., or even file merging. This is very useful for table management and task flow customization. Users can choose the appropriate commit strategy according to their own business needs and scenarios, achieving more flexible and efficient data processing. Our internal business developers have used this feature to develop some custom commit policies, which have brought us a lot of convenience and advantages. • This PR maintains high compatibility with the Flink ecosystem. It uses the `_SUCCESS` file as a marker to indicate partition commit (https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/filesystem/#partition-commit), which makes it easy to migrate stream tables from Filesystem/Hive to Iceberg without changing the original logic and process. This can reduce migration costs and risks, and improve migration efficiency and stability. Thank you again for your review and feedback. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
