Hi Mayur,

I think what you describe of writing in parallel and committing using a
coordinator is the strategy used by most of the engines today. The stream
of DataFile (statistics collected from written data files) are passed to
the coordinator to do a single commit. In Spark, it's passed as
WriteCommitMessage (see SparkWrite.commit). In Presto and Trino, it's
passed as CommitTaskData.

I am not sure why staging and cherry-pick is needed, am I missing anything?

Thanks,
Jack Ye

On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava <
mayur.srivast...@twosigma.com> wrote:

> Hi,
>
>
>
> Let’s say there are N (e.g. 32) distributed processes writing to different
> (non-overlapping) partitions in the same Iceberg table in parallel.
>
> When all of them finish writing, is there a way to do a single commit (by
> a coordinator process) at the end so that either all or none is committed?
>
>
>
> I see that there is a snapshot staging support, but can we cherry-pick
> multiple staged snapshots (as long as it is safe)?
>
>
>
> Thanks,
>
> Mayur
>
>
>

Reply via email to