Hi Filip So you team have implemented the tombstone feature in your internal branch. For my understanding, the tombstone you mean is similar to the delete marker in HBase, so you're trying to implement the update/delete feature I think. For this part, Anton and Miguel have a design doc for this , IMO your work should intersect with it.
Another question is: one rule which we shouldn't break (my personal view) is the open file format rule, means encoding data in open format and can view them by using non-iceberg tool, your implementation will follow the rule or not? encoding the tombstone in your own format, or just make them into a separate file and mark it as tombstone file ? For the column predicate filter or row-level filter, I'm not familiar with this part. Mind to provide more details? :-) Thanks for your information :-) . https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/edit# On Tue, Jan 14, 2020 at 10:20 PM Filip <filip....@gmail.com> wrote: > Hi everyone, > > I was wondering if it would be of interest to start a thread on a proposal > for a tombstone feature in Iceberg. > Well, maybe it could be less strict than a general case implementation of > tombstone [*1*]. > I would be looking for at least a couple of things to be covered on this > thread: > > 1. [*Opportunity*] I was wondering if others on this list would find > such a feature useful and/ or if folks would support such a feature be > provided by Iceberg > 2. [*Feasibility*] Also wondering if there's any intersection between > a tombstone feature (i.e. filter by column predicate) and the upcoming > Upsert spec/ implementation or if these two may very well serve different > use-cases so it's wise they shouldn't be mixed up, only indirectly I guess, > by the sheer implications of accidental complexity :) > > The current Iceberg codebase is quite generous wrt to the Open/Closed > principle and we've been doing some spikes to implement such a feature in a > new datasource and I've thought I'd share some touchpoints of our work so > far (would gladly share this if community is interested): > > [*extension*] implementing tombstoning as a column/values predicate > should be associated w/ some specific metadata (snapshot id, version?) and > basic metrics (i.e. count, basic histograms) - mostly thinking that any > tombstone operation feature is accompanied by a compaction task so metadata > and metrics would help with building generic solutions for optimal > scheduling of these maintenance tasks - tombstones could be modeled/ > programmed against the org.apache.iceberg.Table interface > > [*atomic guarantee*] a simple solution is to make the tombstone > operation atomic by assigning a new snapshot summary property point to a > file reference of an immutable file holding the tombstone > predicates/expressions > > [*new API*] append tombstones > > [*new API*] remove tombstones > > [*new API*] append files and add tombstones > > [*new API*] append files and remove tombstones > > [*new API*] vacuum tombstones - a task as in clean up tombstoned rows > and evict the associated tombstone metadata as well, oh and maybe not > `vacuum` (I remember reading about this on this list in a different context > and it's probably reserved for a different Iceberg feature, right?) > > [*extend*] extend the spark reader/writer to account for tombstone > based filtering and tombstone committing respectively - writing may prove > easier to implement than reading, reading comes in many flavours so > applying a filter expression may not be as accessible at the moment as one > would prefer to extend on top of Iceberg [*2*] > > [*extend*] removing snapshots would also account for their associated > tombstone files to be dropped as well (where available) > > [*1*] I believe that in a general tombstone implementation the data > filtering is applied only for the data that was added prior to the > tombstone setting/ assignment operation but that might prove quite > difficult to implement w/ Iceberg and it could be considered a > specialization of a more generic and basic use-case of adding tombstone > support as filtering by column values regardless of order of operations > (data vs tombstone). > [*2*] We could benefit from adding like an extension point/ hook into > row-level filtering that we could leverage to translate tombstone options > into row level filter/ predicates into > https://github.com/apache/incubator-iceberg/blob/6048e5a794242cb83871e3838d0d40aa71e36a91/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L438-L439 > > /Filip >