Hi Paimon Devs, I'd like to start a discussion Enhance Git-like operation capabilities.
In the AI era, data lakes have become the system of record for large-scale machine learning and foundation model training. Beyond basic snapshot and branching capabilities, advanced Git-like operations—such as cherry-pick, merge, and rebase—are critical for enabling efficient and safe data evolution. These capabilities are essential for several reasons: 1. Selective Data Promotion (Cherry-pick) Machine learning workflows often require promoting only validated subsets of data changes—such as cleaned samples, corrected labels, or feature fixes—into training or production datasets. Cherry-pick enables precise, low-risk data promotion without pulling in unrelated or experimental changes. 2. Collaborative Data Development (Merge) Large-scale ML projects involve multiple teams working in parallel on data preparation, labeling, and feature generation. Merge semantics allow independent data branches to be combined in a controlled manner, enabling collaboration while preserving data consistency. 3. Continuous Data Refinement (Rebase) Training data and feature definitions evolve continuously. Rebase allows experimental data changes to be reapplied onto an updated baseline, keeping experiments aligned with the latest production data without restarting from scratch. What do you think about this proposal? https://docs.google.com/document/d/11Mq7KoRsZVP4Gf1JXrTy2-1tbIRRkHQMiYH15CUZvoY/edit?tab=t.0
