SteNicholas commented on issue #1795: URL: https://github.com/apache/incubator-paimon/issues/1795#issuecomment-1678391369
@FangYongs, thanks for driving this significant feature. I give some knowledge from internal practice. Branch feature have the following typical application scenarios: - A snapshot named compacted-YYYYMMDD is generated on the base table every day, and users use the snapshot table to generate daily derived data tables and calculate report data. When the calculation logic downstream of the user changes, the corresponding snapshot can be selected for recalculation. You can also set the retention period to X days, and clear out expired data every day. In fact, SCD-2 can also be naturally implemented on multi-snapshot data. - An archive branch named yyyy-archived can be generated every year after the data is compressed and optimized. If our storage strategy changes (such as deleting sensitive information), we can generate a branch on this branch after performing related operations. new snapshot. - A snapshot named preprod-xx can be officially released to users after necessary quality checks, avoiding the coupling of external tools and the pipeline itself. In business production scenarios, in order to meet the user's branch needs, more consideration should be given to ease of use and usability. For example, how does the user know that a snapshot has been published correctly? One of the problems involved is visibility. That is to say, users should be able to explicitly get the snapshot table in the entire pipeline. In addition, in the snapshot scenario, a common requirement is the precise segmentation of data. An example is that users actually do not want the data of event time on the 1st to drift to the snapshot of the 2nd, and the more hopeful method is to combine the watermark under each manifest to do fine snapshot segmentation. In order to better meet the needs in the production environment, we have implemented the following optimizations: - Provides snapshot branch and lifecycle management feature. - On the MergeOnRead table, we could provide data accurate to 0 points for downstream offline processing on the basis of streaming writing. Branch feature is aimed to solve the problem of real-time data entry into the lake, and only supports incremental partitioning and waste of full partition storage. Meanwhile, the support of branch could bring the following benefits: - Reduce storage costs, only save one storage. - Improve efficiency, the end-to-end pipeline is simple, and the cost of engineering and productization is low - Through time travel branch management, risk-free access - Partition segmentation is accurate and does not require additional filter conditions for users - A table provides incremental partitioning, full partitioning, and real-time partitioning. The user sql remains unchanged, and only the hint/option method is changed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
