Hi Bin, Thanks for your proposal! This is also what I want!
Is your main scenario an Append table? Or do we also need a primary key table? If we don't currently need these capabilities of the primary key table, we can consider supporting only the Append table for now. Complex primary key tables are supported and can be placed in the next period. Best, Jingsong On Fri, Dec 19, 2025 at 11:39 AM bin zhou <[email protected]> wrote: > > Hi Paimon Devs, > > I'd like to start a discussion Enhance Git-like operation capabilities. > > In the AI era, data lakes have become the system of record for large-scale > machine learning and foundation model training. > Beyond basic snapshot and branching capabilities, advanced Git-like > operations—such as cherry-pick, merge, and rebase—are critical for enabling > efficient and safe data evolution. > > These capabilities are essential for several reasons: > 1. Selective Data Promotion (Cherry-pick) > Machine learning workflows often require promoting only validated subsets > of data changes—such as cleaned samples, corrected labels, or feature > fixes—into training or production datasets. Cherry-pick enables precise, > low-risk data promotion without pulling in unrelated or experimental > changes. > 2. Collaborative Data Development (Merge) > Large-scale ML projects involve multiple teams working in parallel on data > preparation, labeling, and feature generation. Merge semantics allow > independent data branches to be combined in a controlled manner, enabling > collaboration while preserving data consistency. > 3. Continuous Data Refinement (Rebase) > Training data and feature definitions evolve continuously. Rebase allows > experimental data changes to be reapplied onto an updated baseline, keeping > experiments aligned with the latest production data without restarting from > scratch. > > > What do you think about this proposal? > > https://docs.google.com/document/d/11Mq7KoRsZVP4Gf1JXrTy2-1tbIRRkHQMiYH15CUZvoY/edit?tab=t.0
