Re: Iceberg tombstone?

Miao Wang Mon, 27 Jan 2020 15:34:17 -0800

Hi Ryan,

Just found your comment in my junk mail box.

` It sounds like a tombstone in independent of the table snapshot, like 
blacklisting a row. ` -> It goes into snapshot. It was first developed for 
Adobe use case which is not a row-level delete marker. However, it is 
extensible/generic to support row-level deletes.

@Filip Bocse<mailto:[email protected]> can provide more details on Tombstone 
design and use case.

Miao

From: Ryan Blue <[email protected]>
Reply-To: "[email protected]" <[email protected]>, 
"[email protected]" <[email protected]>
Date: Tuesday, January 21, 2020 at 5:40 PM
To: Iceberg Dev List <[email protected]>
Subject: Re: Iceberg tombstone?

Hi everyone,

Thanks for bringing this up, Filip. I've been thinking about it over the last 
few days. One thing that I'm not clear about is how exactly tombstones differ 
from the row-level deletes that we're already working on. It sounds like a 
tombstone in independent of the table snapshot, like blacklisting a row. So if 
I tombstone user = 'rdblue', it doesn't just apply to all the data currently in 
the table; it also deletes any future additions of that user. Is that correct?

If that's right, then I think that could be added fairly easily using the 
existing filter code and our Table interface to wrap existing readers. On the 
other hand, if that's not what you want then I'm not sure how it differs from 
the existing plan to add row-level deletes. Can you clarify?

rb

On Wed, Jan 15, 2020 at 11:04 AM Miao Wang <[email protected]> wrote:
Hi Filip,

I vote for this feature because it is extensible for [1].

Besides Openlnx’s comments, can you provide a high level description on how 
tombstone aligns with milestone [1]?

[1]. 
https://github.com/apache/incubator-iceberg/milestone/4<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fmilestone%2F4&data=02%7C01%7Cmiwang%40adobe.com%7Ca7266ac170474f301c4108d79edc1b92%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637152540544316361&sdata=GMl3IGr8XUA3unFkuvu4ayoKalwq%2FmYSMZTERsDLt2s%3D&reserved=0>

Thanks!

Miao

From: OpenInx <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, January 14, 2020 at 6:34 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Iceberg tombstone?

Hi Filip

So you team have implemented the tombstone feature in your internal branch.  
For my understanding, the tombstone
you mean is similar to the delete marker in HBase, so you're trying to 
implement the update/delete feature I think. For this part,
Anton and Miguel have a design doc for this [1], IMO your work should intersect 
with it.

Another question is: one rule which we shouldn't break (my personal view) is 
the open file format rule, means encoding data in
open format and can view them by using non-iceberg tool, your implementation 
will follow the rule or not? encoding the tombstone
in your own format, or just make them into a separate file and mark it as 
tombstone file ?

For the column predicate filter or row-level filter,  I'm not familiar with 
this part. Mind to provide more details?  :-)

Thanks for your information :-)

[1]. 
https://docs.google.com/document/d/1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38/edit#<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1Pk34C3diOfVCRc-sfxfhXZfzvxwum1Odo-6Jj9mwK38%2Fedit%23&data=02%7C01%7Cmiwang%40adobe.com%7Ca7266ac170474f301c4108d79edc1b92%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637152540544326346&sdata=kZiRYMphDfwzHRykL8L%2B%2Blmy%2F0yaLMnaYFxHvHO32zg%3D&reserved=0>

On Tue, Jan 14, 2020 at 10:20 PM Filip 
<[email protected]<mailto:[email protected]>> wrote:
Hi everyone,

I was wondering if it would be of interest to start a thread on a proposal for 
a tombstone feature in Iceberg.
Well, maybe it could be less strict than a general case implementation of 
tombstone [1].
I would be looking for at least a couple of things to be covered on this thread:

  1.  [Opportunity] I was wondering if others on this list would find such a 
feature useful and/ or if folks would support such a feature be provided by 
Iceberg
  2.  [Feasibility] Also wondering if there's any intersection between a 
tombstone feature (i.e. filter by column predicate) and the upcoming Upsert 
spec/ implementation or if these two may very well serve different use-cases so 
it's wise they shouldn't be mixed up, only indirectly I guess, by the sheer 
implications of accidental complexity :)
The current Iceberg codebase is quite generous wrt to the Open/Closed principle 
and we've been doing some spikes to implement such a feature in a new 
datasource and I've thought I'd share some touchpoints of our work so far 
(would gladly share this if community is interested):
> [extension] implementing tombstoning as a column/values predicate should be 
> associated w/ some specific metadata (snapshot id, version?) and basic 
> metrics (i.e. count, basic histograms) - mostly thinking that any tombstone 
> operation feature is accompanied by a compaction task so metadata and metrics 
> would help with building generic solutions for optimal scheduling of these 
> maintenance tasks - tombstones could be modeled/ programmed against the 
> org.apache.iceberg.Table interface
> [atomic guarantee] a simple solution is to make the tombstone operation 
> atomic by assigning a new snapshot summary property point to a file reference 
> of an immutable file holding the tombstone predicates/expressions
> [new API] append tombstones
> [new API] remove tombstones
> [new API] append files and add tombstones
> [new API] append files and remove tombstones
> [new API] vacuum tombstones - a task as in clean up tombstoned rows and evict 
> the associated tombstone metadata as well, oh and maybe not `vacuum` (I 
> remember reading about this on this list in a different context and it's 
> probably reserved for a different Iceberg feature, right?)
> [extend] extend the spark reader/writer to account for tombstone based 
> filtering and tombstone committing respectively - writing may prove easier to 
> implement than reading, reading comes in many flavours so applying a filter 
> expression may not be as accessible at the moment as one would prefer to 
> extend on top of Iceberg [2]
> [extend] removing snapshots would also account for their associated tombstone 
> files to be dropped as well (where available)

[1] I believe that in a general tombstone implementation the data filtering is 
applied only for the data that was added prior to the tombstone setting/ 
assignment operation but that might prove quite difficult to implement w/ 
Iceberg and it could be considered a specialization of a more generic and basic 
use-case of adding tombstone support as filtering by column values regardless 
of order of operations (data vs tombstone).
[2] We could benefit from adding like an extension point/ hook into row-level 
filtering that we could leverage to translate tombstone options into row level 
filter/ predicates into 
https://github.com/apache/incubator-iceberg/blob/6048e5a794242cb83871e3838d0d40aa71e36a91/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L438-L439<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-iceberg%2Fblob%2F6048e5a794242cb83871e3838d0d40aa71e36a91%2Fspark%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2Fspark%2Fsource%2FReader.java%23L438-L439&data=02%7C01%7Cmiwang%40adobe.com%7Ca7266ac170474f301c4108d79edc1b92%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637152540544326346&sdata=OILuzXZnKBRXe2Yi3uP3jcy%2Fz4JUYraEdzuBHZEpkTA%3D&reserved=0>

/Filip

--
Ryan Blue
Software Engineer
Netflix

Re: Iceberg tombstone?

Reply via email to