Upserts in Iceberg

Erik Wright Fri, 21 Jun 2019 09:06:59 -0700

With regards to operation values. Currently they are:

   - append: data files were added and no files were removed.
   - replace: data files were rewritten with the same data; i.e.,
   compaction, changing the data file format, or relocating data files.
   - overwrite: data files were deleted and added in a logical overwrite
   operation.
   - delete: data files were removed and their contents logically deleted.


If deletion files (with or without data files) are appended to the dataset,
will we consider that an `append` operation? If so, if deletion and/or data
files are appended, and whole files are also deleted, will we consider that
an `overwrite`?

Given that the only apparent purpose of the operation field is to optimize
snapshot expiration the above seems to meet its needs. An incremental
reader can also skip `replace` snapshots but no others. Once it decides to
read a snapshot I don't think there's any difference in how it processes
the data for append/overwrite/delete cases.

On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <[email protected]> wrote:

> I don’t see that we need [sequence numbers] for file/offset-deletes, since
> they apply to a specific file. They’re not harmful, but the don’t seem
> relevant.
>
> These delete files will probably contain a path and an offset and could
> contain deletes for multiple files. In that case, the sequence number can
> be used to eliminate delete files that don’t need to be applied to a
> particular data file, just like the column equality deletes. Likewise, it
> can be used to drop the delete files when there are no data files with an
> older sequence number.
>
> I don’t understand the purpose of the min sequence number, nor what the
> “min data seq” is.
>
> Min sequence number would be used for pruning delete files without reading
> all the manifests to find out if there are old data files. If no manifest
> with data for a partition contains a file older than some sequence number
> N, then any delete file with a sequence number < N can be removed.
>
OK, so the minimum sequence number is an attribute of manifest files.
Sounds good. It can likely permit us to optimize compaction operations as
well (i.e., you can easily limit the operation to a subset of manifest
files as long as they are the oldest ones).


> The “min data seq” is the minimum sequence number of a data file. That
> seems like what we actually want for the pruning I described above.
>
I would expect a data file (appended rows or deletions by column value) to
have a single sequence number that applies to the whole file. Even a
delete-by-file-and-offset file can do with only a single sequence number
(which must be larger than the sequence numbers of all deleted files). Why
do we need a "minimum" data sequence per file?

> Off the top of my head [supporting non-key delete] requires adding
> additional information to the manifest file, indicating the columns that
> are used for the deletion. Only equality would be supported; if multiple
> columns were used, they would be combined with boolean-and. I don’t see
> anything too tricky about it.
>
> Yes, exactly. I actually phrased it wrong initially. I think it would be
> simple to extend the equality deletes to do this. We just need a way to
> have global scope, not just partition scope.
>
I don't think anything special needs to be done with regards to
scoping/partitioning of delete files. When scanning one or more data files,
one must also consider any and all deletion files that could apply to them.
The only way to prune deletion files from consideration is:

   1. All of your data files have at least one partition column in common.
   2. The deletion file is also partitioned on that column (at least).
   3. The value sets of the data files do not overlap the value sets of the
   deletion files in that column.

 So given a dataset of sessions that is partitioned by device form factor
and date, for example, you could have a delete (user_id=9876) in a deletion
file that is not partitioned. And it would be "in scope" for all of those
data files.

If you had the same dataset partitioned by hash(user_id) and your deletes
were _also_ partitioned by hash(user_id) you would be able to prune those
deletes while scanning the sessions.

> If we add this on a per-deletion file basis it is not clear if there is
> any relevance in preserving the concept of a unique row ID.
>
> Agreed. That’s why I’ve been steering us away from the debate about
> whether keys are unique or not. Either way, a natural key delete must
> delete all of the records it matches.
>
> I would assume that the maximum sequence number should appear in the table
> metadata
>
> Agreed.
>
> [W]ould you make it optional to assign a sequence number to a snapshot?
> “Replace” snapshots would not need one.
>
> The only requirement is that it is monotonically increasing. If one isn’t
> used, we don’t have to increment. I’d say it is up to the implementation to
> decide. I would probably increment it every time to avoid errors.
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updates/Deletes/Upserts in Iceberg

Reply via email to