Hi Jacques, It’s not at all clear why unique keys would be needed at all.
Also truly independent writers (like a job writing while a job compacts), means effectively a distributed transaction, and I believe it’s clearly out of scope for Iceberg to solve that ? > On 21 May 2019, at 21:31, Jacques Nadeau <jacq...@dremio.com> wrote: > > >> It would be useful to describe the types of concurrent operations that would >> be supported (i.e., failed snapshotting could easily be recovered, vs. the >> whole operation needing to be re-executed) vs. those that wouldn't. Solving >> for unlimited concurrency cases may create way more complexity than is >> necessary. > > I'd like to restate my comment a little bit. We need unique keys to make > things work. They can be synthetic or not but they should not have any > retrievable iceberg related data in them. > > The main thing I'm talking about is how you target a deletion across time. If > you have a file A, and you want to delete record X in A, you define delete > A.X. At the same time, another process may be compacting A into A'. In so > doing, the position of A.X in A' is something other than X. At this point, > the deletion needs to be rerun against A' so that we can ensure that the > deletion is propagated forward. If the only thing you have is A.X, you need > to have way from of getting to the same location in A'. You should be able to > take the delta file that lists the delete of A.2 and apply it directly to A' > without having to also consult A. If you didn't need to solve this number, > then you could simply use A.X as opposed to the key of A.X in your delta > files. > >> Synthetic seems relative. If the synthetic key is client-supplied, in what >> way is it relevant to Iceberg whether it is synthetic vs. natural? By >> calling it synthetic within Iceberg there is a strong implication that it is >> the implementation that generates it (the filename/position key suggests >> that). If it's the client that supplies it, it _may_ be synthetic (from the >> point of view of the overall data model; i.e. a customer key in a database >> vs. a customer ID that shows up on a bill) but from Iceberg's case that >> doesn't matter. Only the unicity constraint does. > > I agree with the main statement here: the only real requirement is keys need > to be unique across all existing snapshots. There could be two generators: > one that uses an iceberg internal behavior to generate keys and one that is > user definable. While there could be a third which uses an existing field (or > set of fields) to define the key I think we probably should avoid > implementing this as it has a whole other sets of problems that are best left > outside of Iceberg's area of concern. >