Upserts in Iceberg

Cristian Opris Tue, 21 May 2019 13:59:40 -0700

Hi Jacques,

It’s not at all clear why unique keys would be needed at all.


Also truly independent writers (like a job writing while a job compacts), means 
effectively a distributed transaction, and I believe it’s clearly out of scope 
for Iceberg to solve that ?

> On 21 May 2019, at 21:31, Jacques Nadeau <jacq...@dremio.com> wrote:
> 
> 
>> It would be useful to describe the types of concurrent operations that would 
>> be supported (i.e., failed snapshotting could easily be recovered, vs. the 
>> whole operation needing to be re-executed) vs. those that wouldn't. Solving 
>> for unlimited concurrency cases may create way more complexity than is 
>> necessary.
> 
> I'd like to restate my comment a little bit. We need unique keys to make 
> things work. They can be synthetic or not but they should not have any 
> retrievable iceberg related data in them.
> 
> The main thing I'm talking about is how you target a deletion across time. If 
> you have a file A, and you want to delete record X in A, you define delete 
> A.X. At the same time, another process may be compacting A into A'. In so 
> doing, the position of A.X in A' is something other than X. At this point, 
> the deletion needs to be rerun against A' so that we can ensure that the 
> deletion is propagated forward. If the only thing you have is A.X, you need 
> to have way from of getting to the same location in A'. You should be able to 
> take the delta file that lists the delete of A.2 and apply it directly to A' 
> without having to also consult A. If you didn't need to solve this number, 
> then you could simply use A.X as opposed to the key of A.X in your delta 
> files.
> 
>> Synthetic seems relative. If the synthetic key is client-supplied, in what 
>> way is it relevant to Iceberg whether it is synthetic vs. natural? By 
>> calling it synthetic within Iceberg there is a strong implication that it is 
>> the implementation that generates it (the filename/position key suggests 
>> that). If it's the client that supplies it, it _may_ be synthetic (from the 
>> point of view of the overall data model; i.e. a customer key in a database 
>> vs. a customer ID that shows up on a bill) but from Iceberg's case that 
>> doesn't matter. Only the unicity constraint does.
> 
> I agree with the main statement here: the only real requirement is keys need 
> to be unique across all existing snapshots. There could be two generators: 
> one that uses an iceberg internal behavior to generate keys and one that is 
> user definable. While there could be a third which uses an existing field (or 
> set of fields) to define the key I think we probably should avoid 
> implementing this as it has a whole other sets of problems that are best left 
> outside of Iceberg's area of concern.
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to