SteNicholas commented on issue #1795:
URL: 
https://github.com/apache/incubator-paimon/issues/1795#issuecomment-1678391369

   @FangYongs, thanks for driving this significant feature. I give some 
knowledge from internal practice.
   
   Branch feature have the following typical application scenarios:
   
   - A snapshot named compacted-YYYYMMDD is generated on the base table every 
day, and users use the snapshot table to generate daily derived data tables and 
calculate report data. When the calculation logic downstream of the user 
changes, the corresponding snapshot can be selected for recalculation. You can 
also set the retention period to X days, and clear out expired data every day. 
In fact, SCD-2 can also be naturally implemented on multi-snapshot data.
   
   - An archive branch named yyyy-archived can be generated every year after 
the data is compressed and optimized. If our storage strategy changes (such as 
deleting sensitive information), we can generate a branch on this branch after 
performing related operations. new snapshot.
   
   - A snapshot named preprod-xx can be officially released to users after 
necessary quality checks, avoiding the coupling of external tools and the 
pipeline itself.
   
   In business production scenarios, in order to meet the user's branch needs, 
more consideration should be given to ease of use and usability. For example, 
how does the user know that a snapshot has been published correctly? One of the 
problems involved is visibility. That is to say, users should be able to 
explicitly get the snapshot table in the entire pipeline. In addition, in the 
snapshot scenario, a common requirement is the precise segmentation of data. An 
example is that users actually do not want the data of event time on the 1st to 
drift to the snapshot of the 2nd, and the more hopeful method is to combine the 
watermark under each manifest to do fine snapshot segmentation.
   
   In order to better meet the needs in the production environment, we have 
implemented the following optimizations:
   
   - Provides snapshot branch and lifecycle management feature.
   
   - On the MergeOnRead table, we could provide data accurate to 0 points for 
downstream offline processing on the basis of streaming writing.
   
   Branch feature is aimed to solve the problem of real-time data entry into 
the lake, and only supports incremental partitioning and waste of full 
partition storage. Meanwhile, the support of branch could bring the following 
benefits:
   
   - Reduce storage costs, only save one storage.
   
   - Improve efficiency, the end-to-end pipeline is simple, and the cost of 
engineering and productization is low
   
   - Through time travel branch management, risk-free access
   
   - Partition segmentation is accurate and does not require additional filter 
conditions for users
   
   - A table provides incremental partitioning, full partitioning, and 
real-time partitioning. The user sql remains unchanged, and only the 
hint/option method is changed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to