On Wed, Jun 19, 2019 at 6:39 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> Replies inline. > > On Mon, Jun 17, 2019 at 10:59 AM Erik Wright > erik.wri...@shopify.com.invalid > <http://mailto:erik.wri...@shopify.com.invalid> wrote: > > Because snapshots and versions are basically the same idea, we don’t need >>> both. If we were to add versions, they should replace snapshots. >>> >> >> I'm a little confused by this. You mention sequence numbers quite a bit, >> but then say that we should not introduce versions. From my understanding, >> sequence numbers and versions are essentially identical. What has changed >> is where they are encoded (in file metadata vs. in snapshot metadata). It >> would also seem necessary to keep them distinct from snapshots in order to >> be able to collapse/compact files older than N (oldest valid version) >> without losing the ability to incrementally apply the changes from N + 1. >> > Snapshots are versions of the table, but Iceberg does not use a > monotonically increasing identifier to track them. Your proposal added > table versions in addition to snapshots to get a monotonically increasing > ID, but my point is that we don’t need to duplicate the “table state at > some time” idea by adding a layer of versions inside snapshots. We just > need a monotonically increasing ID associated with each snapshot. That ID > what I’m calling a sequence number. Each snapshot will have a new sequence > number used to track the order of files for applying changes that are > introduced by that snapshot. > If there is a monotonically increasing value per snapshot, should we simply update the requirement for the snapshot ID to require it to be monotonically increasing? > With sequence numbers embedded in metadata, you can still compact. If you > have a file with sequence number N, then updated by sequence numbers N+2 > and N+3, you can compact the changes from N and N+2 by writing the > compacted file with sequence N+2. Rewriting files in N+2 is still possible, > just like with the scheme you suggested, where the version N+2 would be > rewritten. > > A key difference is that because the approach using sequence numbers > doesn’t require keeping state around in the root metadata file, tables are > not required to rewrite or compact data files just to keep the table > functioning. If the data won’t be read, then why bother changing it? > > Because I am having a really hard time seeing how things would work >> without these sequence numbers (versions), and given their prominence in >> your reply, my remaining comments assume they are present, distinct from >> snapshots, and monotonically increasing. >> > Yes, a sequence number would be added to the existing metadata. Each > snapshot would produce a new sequence number. > > Another way of thinking about it: if versions are stored as data/delete >>> file metadata and not in the table metadata file, then the complete history >>> can be kept. >> >> >> Sure. Although version expiry would still be an important process to >> improve read performance and reduce storage overhead. >> > Maybe. If it is unlikely that the data will be read, the best option may > be to leave the original data file and associated delete file. > It's safe to say that is optional for any given use-case, but that it is a required capability. > Based on your proposal, it would seem that we can basically collapse any >> given data file with its corresponding deletions. It's important to >> consider how this will affect incremental readers. The following make sense >> to me: >> >> 1. We should track an oldest-valid-version in the table metadata. No >> changes newer than this version should be collapsed, as a reader who has >> consumed up to that version must be able to continue reading. >> >> Table versions are tracked by snapshots, which are kept for a window of > time. The oldest valid version is the oldest snapshot in the table. New > snapshots can always compact or rewrite older data and incremental > consumers ignore those changes by skipping snapshots where the operation > was replace because the data has not changed. > Incremental consumers operate on append, delete, and overwrite snapshots > and consume data files that are added or deleted. With delete files, these > would also consume delete files that were added in a snapshot. > Sure, when a whole file is deleted there is already a column for indicating that, so we do not need to represent it a second time. > It is unlikely that sequence number would be used by incremental consumers > because the changes are already available for each snapshot. This > distinction is the main reason why I’m calling these sequence numbers and > not “version”. The sequence number for a file indicates what delete files > need to be applied; files can be rewritten the new copy gets a different > sequence number as you describe just below. > Interesting. This explains the difference in the way we are looking at this. Yes, I agree that this ought to work and seems more consistent with the way that Iceberg works currently. > >> 1. Any data file strictly older than the oldest-valid-version is >> eligible for delete collapsing. During delete collapsing, all deletions up >> to version N (any version <= the oldest-valid-version) are applied to the >> file. *The newly created file should be assigned a sequence number of >> N.* >> >> First, a data file is kept until the last snapshot that references it is > expired. That enable incremental consumers without a restriction on > rewriting the file in a later replace snapshot. > > Second, I agree with your description of rewriting the file. That’s what I > was trying to say above. > Given that incremental consumers will skip "replace" snapshots and will not attempt to incrementally read the changes introduced in version "N" using the manifest list file for version "N+1", there would appear to be no restrictions on how much data can be collapsed in any rewrite operation. The sequence number assigned to the resulting files is irrelevant if _all_ relevant deletes have been applied. > >> 1. Any delete file that has been completely applied is eligible for >> expiry. A delete file is completely applied if its sequence number is <= >> all data files to which it may apply (based on partitioning data). >> >> Yes, I agree. > > >> 1. Any data file equal to or older than the oldest-valid-version is >> eligible for compaction. During compaction, some or all data files up to >> version N (any version <= the oldest-valid version) are selected. There >> may >> not be any deletions applicable to any of the selected data files. The >> selected data files are read, rewritten as desired (partitioning, >> compaction, sorting, etc.). The new files are included in the new >> snapshot (*with >> sequence number N*) while the old files are dropped. >> >> Yes, I agree. > Ultimately, all files would appear to be eligible for compaction in any rewrite snapshot. > >> 1. Delete collapsing, delete expiry, and compaction may be performed >> in a single commit. The oldest-valid-version may be advanced during this >> process as well. The outcome must logically be consistent with applying >> these steps independently. >> >> Because snapshots (table versions) are independent, they can always be > expired. Incremental consumers must process a new snapshot before it is old > enough to expire. The time period after which snapshots are expired is up > to the process running ExpireSnapshots. > Agreed. > Important to note in the above is that, during collapsing/compaction, new >> files may be emitted with a version number N that are older than the most >> recent version. This is important to ensure that all deltas newer than N >> are appropriately applied, and that incremental readers are able to >> continue processing the dataset. >> > Because incremental consumers operate using snapshots and not sequence > numbers, I think this is decoupled in the approach I’m proposing. > Agreed. > It is also compatible with file-level deletes used to age off old data >>> (e.g. delete where day(ts) < '2019-01-01'). >> >> >> This particular operation is not compatible with incremental consumption. >> > I disagree. This just encodes deletes for an entire file worth of rows in > the manifest file instead of in a delete file. Current metadata tracks when > files are deleted and we would also associate a sequence number with those > changes, if we needed to. Incremental consumers will work like they do > today: they will consume snapshots that contain changes > (append/delete/overwrite) and will get the files that were changed in that > snapshot. > Agreed. > Also, being able to cleanly age off data is a hard requirement for > Iceberg. We all need to comply with data policies with age limits and we > need to ensure that we can cleanly apply those changes. > I don't think that the other approach inhibits aging off data, it just represents the deletion of the data differently. In any case, I agree that we can reuse the existing "deleted in snapshot" mechanism. > One must still generate a deletion that is associated with a new sequence >> number. I could see a reasonable workaround where, in order to delete an >> entire file (insertion file `x.dat`, sequence number X) one would reference >> the same file a second time (deletion file `x.dat, sequence number Y > X). >> Since an insertion file and a deletion file have compatible formats, there >> is no need to rewrite this file simply to mark each row in it as deleted. A >> savvy consumer would be able to tell that this is a whole-file deletion >> while a naïve consumer would be able to apply them using the basic >> algorithm. >> > With file-level deletes, the files are no longer processed by readers, > unless an incremental reader needs to read all of the deletes from the file. > > If it seems like we are roughly on the same page I will take a stab at >> updating that document to go to the same level of detail that it does now >> but use the sequence-number approach. >> > Sure, that sounds good. > Phew. I think we are nearly there! > -- > Ryan Blue > Software Engineer > Netflix >