I apologize for the delay on my side. I’ll still have to go through the last emails. I am available on Thursday/Friday this week and would be great to sync.
Thanks, Anton > On 3 Jul 2019, at 01:29, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Sorry I didn't get back to this thread last week. Let's try to have a video > call to sync up on this next week. What days would work for everyone? > > rb > > On Fri, Jun 21, 2019 at 9:06 AM Erik Wright <erik.wri...@shopify.com > <mailto:erik.wri...@shopify.com>> wrote: > With regards to operation values. Currently they are: > append: data files were added and no files were removed. > replace: data files were rewritten with the same data; i.e., compaction, > changing the data file format, or relocating data files. > overwrite: data files were deleted and added in a logical overwrite operation. > delete: data files were removed and their contents logically deleted. > If deletion files (with or without data files) are appended to the dataset, > will we consider that an `append` operation? If so, if deletion and/or data > files are appended, and whole files are also deleted, will we consider that > an `overwrite`? > > Given that the only apparent purpose of the operation field is to optimize > snapshot expiration the above seems to meet its needs. An incremental reader > can also skip `replace` snapshots but no others. Once it decides to read a > snapshot I don't think there's any difference in how it processes the data > for append/overwrite/delete cases. > > On Thu, Jun 20, 2019 at 8:55 PM Ryan Blue <rb...@netflix.com > <mailto:rb...@netflix.com>> wrote: > I don’t see that we need [sequence numbers] for file/offset-deletes, since > they apply to a specific file. They’re not harmful, but the don’t seem > relevant. > > These delete files will probably contain a path and an offset and could > contain deletes for multiple files. In that case, the sequence number can be > used to eliminate delete files that don’t need to be applied to a particular > data file, just like the column equality deletes. Likewise, it can be used to > drop the delete files when there are no data files with an older sequence > number. > > I don’t understand the purpose of the min sequence number, nor what the “min > data seq” is. > > Min sequence number would be used for pruning delete files without reading > all the manifests to find out if there are old data files. If no manifest > with data for a partition contains a file older than some sequence number N, > then any delete file with a sequence number < N can be removed. > > OK, so the minimum sequence number is an attribute of manifest files. Sounds > good. It can likely permit us to optimize compaction operations as well > (i.e., you can easily limit the operation to a subset of manifest files as > long as they are the oldest ones). > > The “min data seq” is the minimum sequence number of a data file. That seems > like what we actually want for the pruning I described above. > > I would expect a data file (appended rows or deletions by column value) to > have a single sequence number that applies to the whole file. Even a > delete-by-file-and-offset file can do with only a single sequence number > (which must be larger than the sequence numbers of all deleted files). Why do > we need a "minimum" data sequence per file? > Off the top of my head [supporting non-key delete] requires adding additional > information to the manifest file, indicating the columns that are used for > the deletion. Only equality would be supported; if multiple columns were > used, they would be combined with boolean-and. I don’t see anything too > tricky about it. > > Yes, exactly. I actually phrased it wrong initially. I think it would be > simple to extend the equality deletes to do this. We just need a way to have > global scope, not just partition scope. > > I don't think anything special needs to be done with regards to > scoping/partitioning of delete files. When scanning one or more data files, > one must also consider any and all deletion files that could apply to them. > The only way to prune deletion files from consideration is: > All of your data files have at least one partition column in common. > The deletion file is also partitioned on that column (at least). > The value sets of the data files do not overlap the value sets of the > deletion files in that column. > So given a dataset of sessions that is partitioned by device form factor and > date, for example, you could have a delete (user_id=9876) in a deletion file > that is not partitioned. And it would be "in scope" for all of those data > files. > > If you had the same dataset partitioned by hash(user_id) and your deletes > were _also_ partitioned by hash(user_id) you would be able to prune those > deletes while scanning the sessions. > If we add this on a per-deletion file basis it is not clear if there is any > relevance in preserving the concept of a unique row ID. > > Agreed. That’s why I’ve been steering us away from the debate about whether > keys are unique or not. Either way, a natural key delete must delete all of > the records it matches. > > I would assume that the maximum sequence number should appear in the table > metadata > > Agreed. > > [W]ould you make it optional to assign a sequence number to a snapshot? > “Replace” snapshots would not need one. > > The only requirement is that it is monotonically increasing. If one isn’t > used, we don’t have to increment. I’d say it is up to the implementation to > decide. I would probably increment it every time to avoid errors. > > -- > Ryan Blue > Software Engineer > Netflix > > > -- > Ryan Blue > Software Engineer > Netflix