Upserts in Iceberg

Ryan Blue Fri, 17 May 2019 10:24:03 -0700

Sorry, I didn't mean to reply to only Erik. Here's my response from
yesterday.

---------- Forwarded message ---------
From: Ryan Blue <rb...@netflix.com>
Date: Thu, May 16, 2019 at 1:13 PM
Subject: Re: Updates/Deletes/Upserts in Iceberg
To: Erik Wright <erik.wri...@shopify.com>

Replies inline.

On Thu, May 16, 2019 at 10:07 AM Erik Wright <erik.wri...@shopify.com>
wrote:

> I would be happy to participate. Iceberg with merge-on-read capabilities
> is a technology choice that my team is actively considering. It appears
> that our scenario differs meaningfully from the one that Anton and Miguel
> are considering. It would be great to take the time to compare the two and
> see if there is a single implementation that can meet the needs of each
> scenario.
>

Can you be more specific about where the use cases differ meaningfully? I
think that if we agree that operations on natural keys can be implemented
using synthetic keys to encode deletes (#2), then everyone is aligned on
the core parts of a design. We can figure out the implications of how
synthetic keys are encoded, but I don't see that issue (#3) having a huge
impact on use cases. So is #2 the main disagreement?

> On Wed, May 15, 2019 at 3:55 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> *2. Iceberg diff files should use synthetic keys*
>>
>> A lot of the discussion on the doc is about whether natural keys are
>> practical or what assumptions we can make or trade about them. In my
>> opinion, Iceberg tables will absolutely need natural keys for reasonable
>> use cases. And those natural keys will need to be unique. And Iceberg will
>> need to rely on engines to enforce that uniqueness.
>>
>> But, there is a difference between table behavior and implementation. We
>> can use synthetic keys to implement the requirements of natural keys. Each
>> row should be identified by its file and position in a file. When deleting
>> by a natural key, we just need to find out what the synthetic key is and
>> encode that in the delete diff.
>>
> This comment has important implications for the effort required to
> generate delete diff files. I've tried to cover why in comments I added
> today to the doc, but it could also be a topic of the hangout.
>

Do you mean that you can't encode a delete without reading data to locate
the affected rows?

*3. Synthetic keys should be based on filename and position*
>>
>> I think identifying the file in a synthetic key makes a lot of sense.
>> This would allow for delta file reuse as individual files are rewritten by
>> a “major” compaction and provides nice flexibility that fits with the
>> format. We will need to think through all the impacts, like how file
>> relocation works (e.g., move between regions) and the requirements for
>> rewrites (must apply the delta when rewriting).
>>
> I'm confused. I feel like specifying the filename has the opposite effect.
> One of the biggest advantages of Iceberg is the decoupling of a dataset
> from physical location of the constituent files. If a delta file encodes
> the filename of the row that it updates/deletes you are putting a
> significant constraint on the way that an implementation can manipulate
> those files later.
>

If I understand your concern, it is that we are encoding a file location in
the delete diff. We could solve this with a level of indirection like an ID
for data files in table metadata. So there are options to make sure we can
still move files and data around.

What I like about using a filename or file-specific identifier is that the
deltas are tied to a particular file. When that file is deleted, the delta
no longer needs to be carried forward. So if I have a maintenance job that
is compacting small files, it must apply deletes when rewriting all of
those files. But we don't have to rewrite or replace the delete diff file
because its deletes were scoped to a file that no longer exists.

This is a good way around an ugly problem of knowing when to apply a
particular delete diff. Say we are using a UUID column for a natural key.
Then deletes just need to encode a set of UUIDs. But when a row is
upserted, it gets deleted and then re-appended with the same (natural) UUID
key. So we have to scope the delete to just the file that contained the
original row and not the inserted data file. It is tempting to use snapshot
ordering for this (delete diff applies to versions < V) but we eventually
lose the order of snapshots when they expire. We could also use insert diff
files, but then replacing the same row more than once has the same problem
(how does a delete apply to an insert diff?). Then to correctly encode
state, we would have to either solve the order problem or require a minor
compaction.

Scoping the delete diffs using a file solves this problem tying the
lifecycle of deltas to the lifecycle of files. If a file is live in the
table, then all deletes for that file must be applied when reading or
compacting it. When compacting, you can either write a delete diff for the
new file or remove the deleted records. That requirement seems pretty
reasonable to me.

-- 
Ryan Blue
Software Engineer
Netflix

Fwd: Updates/Deletes/Upserts in Iceberg

Reply via email to