hililiwei commented on PR #7638:
URL: https://github.com/apache/iceberg/pull/7638#issuecomment-1553922039

   @stevenzwu Thank you for taking the time to review. The purpose of this PR 
is to implement a partition commit policy based on Iceberg tables, which has 
the following differences and advantages compared to proposal #6253:
   
   • Proposal #6253 only writes the watermark of the Flink task to the table’s 
Summary, rather than actually committing partitions. This makes it impossible 
for downstream application to directly determine which partitions are visible, 
and they need to calculate it themselves based on the watermark and the each 
partition time. Moreover, the value of the watermark may decrease due to task 
restart or data replay, making some partitions that were already visible 
invisible again, which is unacceptable in a production environment. This PR 
commits partitions based on the watermark and the event time of the partition, 
so that downstream application can directly see which partitions are available 
without extra calculation and judgment. This can improve the query efficiency 
and experience of downstream application. In scenarios such as BI and ad hoc, 
it can also avoid the problem of developers forgetting to filter data based on 
the watermark, and not every developer knows that they need t
 o use the watermark to filter data, this greatly increases the possibility of 
them getting illegal data.
   
   • This PR allows users to customize the partition commit policy, which can 
perform some high-level custom operations when committing partitions. In some 
scenarios, downstream applications need to not only process newly committed 
partitions, but also deal with data from old partitions that arrive late, and 
perform some custom operations on the table, such as remote API calls, event 
notifications, data deletion, etc., or even file merging. This is very useful 
for table management and task flow customization. Users can choose the 
appropriate commit strategy according to their own business needs and 
scenarios, achieving more flexible and efficient data processing. Our internal 
business developers have used this feature to develop some custom commit 
policies, which have brought us a lot of convenience and advantages.
   
   • This PR maintains high compatibility with the Flink ecosystem. It uses the 
`_SUCCESS` file as a marker to indicate partition commit 
(https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/connectors/table/filesystem/#partition-commit),
 which makes it easy to migrate stream tables from Filesystem/Hive to Iceberg 
without changing the original logic and process. This can reduce migration 
costs and risks, and improve migration efficiency and stability.
   
   Thank you again for your review and feedback.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to