Hi Mayur, I think what you describe of writing in parallel and committing using a coordinator is the strategy used by most of the engines today. The stream of DataFile (statistics collected from written data files) are passed to the coordinator to do a single commit. In Spark, it's passed as WriteCommitMessage (see SparkWrite.commit). In Presto and Trino, it's passed as CommitTaskData.
I am not sure why staging and cherry-pick is needed, am I missing anything? Thanks, Jack Ye On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava < mayur.srivast...@twosigma.com> wrote: > Hi, > > > > Let’s say there are N (e.g. 32) distributed processes writing to different > (non-overlapping) partitions in the same Iceberg table in parallel. > > When all of them finish writing, is there a way to do a single commit (by > a coordinator process) at the end so that either all or none is committed? > > > > I see that there is a snapshot staging support, but can we cherry-pick > multiple staged snapshots (as long as it is safe)? > > > > Thanks, > > Mayur > > >