On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Replies inline.
>
> On Mon, Jun 17, 2019 at 10:59 AM Erik Wright
> erik.wri...@shopify.com.invalid
> <http://mailto:erik.wri...@shopify.com.invalid> wrote:
>
> Because snapshots and versions are basically the same idea, we don’t need
>>> both. If we were to add versions, they should replace snapshots.
>>>
>>
>> I'm a little confused by this. You mention sequence numbers quite a bit,
>> but then say that we should not introduce versions. From my understanding,
>> sequence numbers and versions are essentially identical. What has changed
>> is where they are encoded (in file metadata vs. in snapshot metadata). It
>> would also seem necessary to keep them distinct from snapshots in order to
>> be able to collapse/compact files older than N (oldest valid version)
>> without losing the ability to incrementally apply the changes from N + 1.
>>
> Snapshots are versions of the table, but Iceberg does not use a
> monotonically increasing identifier to track them. Your proposal added
> table versions in addition to snapshots to get a monotonically increasing
> ID, but my point is that we don’t need to duplicate the “table state at
> some time” idea by adding a layer of versions inside snapshots. We just
> need a monotonically increasing ID associated with each snapshot. That ID
> what I’m calling a sequence number. Each snapshot will have a new sequence
> number used to track the order of files for applying changes that are
> introduced by that snapshot.
>
If there is a monotonically increasing value per snapshot, should we simply
update the requirement for the snapshot ID to require it to be
monotonically increasing?


> With sequence numbers embedded in metadata, you can still compact. If you
> have a file with sequence number N, then updated by sequence numbers N+2
> and N+3, you can compact the changes from N and N+2 by writing the
> compacted file with sequence N+2. Rewriting files in N+2 is still possible,
> just like with the scheme you suggested, where the version N+2 would be
> rewritten.
>
> A key difference is that because the approach using sequence numbers
> doesn’t require keeping state around in the root metadata file, tables are
> not required to rewrite or compact data files just to keep the table
> functioning. If the data won’t be read, then why bother changing it?
>
> Because I am having a really hard time seeing how things would work
>> without these sequence numbers (versions), and given their prominence in
>> your reply, my remaining comments assume they are present, distinct from
>> snapshots, and monotonically increasing.
>>
> Yes, a sequence number would be added to the existing metadata. Each
> snapshot would produce a new sequence number.
>
> Another way of thinking about it: if versions are stored as data/delete
>>> file metadata and not in the table metadata file, then the complete history
>>> can be kept.
>>
>>
>> Sure. Although version expiry would still be an important process to
>> improve read performance and reduce storage overhead.
>>
> Maybe. If it is unlikely that the data will be read, the best option may
> be to leave the original data file and associated delete file.
>
It's safe to say that is optional for any given use-case, but that it is a
required capability.

> Based on your proposal, it would seem that we can basically collapse any
>> given data file with its corresponding deletions. It's important to
>> consider how this will affect incremental readers. The following make sense
>> to me:
>>
>>    1. We should track an oldest-valid-version in the table metadata. No
>>    changes newer than this version should be collapsed, as a reader who has
>>    consumed up to that version must be able to continue reading.
>>
>> Table versions are tracked by snapshots, which are kept for a window of
> time. The oldest valid version is the oldest snapshot in the table. New
> snapshots can always compact or rewrite older data and incremental
> consumers ignore those changes by skipping snapshots where the operation
> was replace because the data has not changed.
>
Incremental consumers operate on append, delete, and overwrite snapshots
> and consume data files that are added or deleted. With delete files, these
> would also consume delete files that were added in a snapshot.
>
Sure, when a whole file is deleted there is already a column for indicating
that, so we do not need to represent it a second time.


> It is unlikely that sequence number would be used by incremental consumers
> because the changes are already available for each snapshot. This
> distinction is the main reason why I’m calling these sequence numbers and
> not “version”. The sequence number for a file indicates what delete files
> need to be applied; files can be rewritten the new copy gets a different
> sequence number as you describe just below.
>
Interesting. This explains the difference in the way we are looking at
this. Yes, I agree that this ought to work and seems more consistent with
the way that Iceberg works currently.

>
>>    1. Any data file strictly older than the oldest-valid-version is
>>    eligible for delete collapsing. During delete collapsing, all deletions up
>>    to version N (any version <= the oldest-valid-version) are applied to the
>>    file. *The newly created file should be assigned a sequence number of
>>    N.*
>>
>> First, a data file is kept until the last snapshot that references it is
> expired. That enable incremental consumers without a restriction on
> rewriting the file in a later replace snapshot.
>
> Second, I agree with your description of rewriting the file. That’s what I
> was trying to say above.
>
Given that incremental consumers will skip "replace" snapshots and will not
attempt to incrementally read the changes introduced in version "N" using
the manifest list file for version "N+1", there would appear to be no
restrictions on how much data can be collapsed in any rewrite operation.
The sequence number assigned to the resulting files is irrelevant if _all_
relevant deletes have been applied.

>
>>    1. Any delete file that has been completely applied is eligible for
>>    expiry. A delete file is completely applied if its sequence number is <=
>>    all data files to which it may apply (based on partitioning data).
>>
>> Yes, I agree.
>
>
>>    1. Any data file equal to or older than the oldest-valid-version is
>>    eligible for compaction. During compaction, some or all data files up to
>>    version N (any version <= the oldest-valid version) are selected. There 
>> may
>>    not be any deletions applicable to any of the selected data files. The
>>    selected data files are read, rewritten as desired (partitioning,
>>    compaction, sorting, etc.). The new files are included in the new 
>> snapshot (*with
>>    sequence number N*) while the old files are dropped.
>>
>> Yes, I agree.
>
Ultimately, all files would appear to be eligible for compaction in any
rewrite snapshot.

>
>>    1. Delete collapsing, delete expiry, and compaction may be performed
>>    in a single commit. The oldest-valid-version may be advanced during this
>>    process as well. The outcome must logically be consistent with applying
>>    these steps independently.
>>
>> Because snapshots (table versions) are independent, they can always be
> expired. Incremental consumers must process a new snapshot before it is old
> enough to expire. The time period after which snapshots are expired is up
> to the process running ExpireSnapshots.
>
Agreed.

> Important to note in the above is that, during collapsing/compaction, new
>> files may be emitted with a version number N that are older than the most
>> recent version. This is important to ensure that all deltas newer than N
>> are appropriately applied, and that incremental readers are able to
>> continue processing the dataset.
>>
> Because incremental consumers operate using snapshots and not sequence
> numbers, I think this is decoupled in the approach I’m proposing.
>
Agreed.

> It is also compatible with file-level deletes used to age off old data
>>> (e.g. delete where day(ts) < '2019-01-01').
>>
>>
>> This particular operation is not compatible with incremental consumption.
>>
> I disagree. This just encodes deletes for an entire file worth of rows in
> the manifest file instead of in a delete file. Current metadata tracks when
> files are deleted and we would also associate a sequence number with those
> changes, if we needed to. Incremental consumers will work like they do
> today: they will consume snapshots that contain changes
> (append/delete/overwrite) and will get the files that were changed in that
> snapshot.
>
Agreed.


> Also, being able to cleanly age off data is a hard requirement for
> Iceberg. We all need to comply with data policies with age limits and we
> need to ensure that we can cleanly apply those changes.
>
I don't think that the other approach inhibits aging off data, it just
represents the deletion of the data differently. In any case, I agree that
we can reuse the existing "deleted in snapshot" mechanism.

> One must still generate a deletion that is associated with a new sequence
>> number. I could see a reasonable workaround where, in order to delete an
>> entire file (insertion file `x.dat`, sequence number X) one would reference
>> the same file a second time (deletion file `x.dat, sequence number Y > X).
>> Since an insertion file and a deletion file have compatible formats, there
>> is no need to rewrite this file simply to mark each row in it as deleted. A
>> savvy consumer would be able to tell that this is a whole-file deletion
>> while a naïve consumer would be able to apply them using the basic
>> algorithm.
>>
> With file-level deletes, the files are no longer processed by readers,
> unless an incremental reader needs to read all of the deletes from the file.
>
> If it seems like we are roughly on the same page I will take a stab at
>> updating that document to go to the same level of detail that it does now
>> but use the sequence-number approach.
>>
> Sure, that sounds good.
>
Phew. I think we are nearly there!

> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to