Hi, I agree with Peter here, and I would say that it would be an issue for multi-engine support.
I think, as I already mentioned with others, we should explore an alternative. As the main issue is the datafile scan in streaming context, maybe we could find a way to "index"/correlate for positional deletes with limited scanning. I will think again about that :) Regards JB On Sat, Nov 9, 2024 at 6:48 AM Péter Váry <peter.vary.apa...@gmail.com> wrote: > Hi Imran, > > I don't think it's a good idea to start creating multiple types of Iceberg > tables. Iceberg's main selling point is compatibility between engines. If > we don't have readers and writers for all types of tables, then we remove > compatibility from the equation and engine specific formats always win. > OTOH, if we write readers and writers for all types of tables then we are > back on square one. > > Identifier fields are a table schema concept and used in many cases during > query planning and execution. This is why they are defined as part of the > SQL spec, and this is why Iceberg defines them as well. One use case is > where they can be used to merge deletes (independently of how they are > manifested) and subsequent inserts, into updates. > > Flink SQL doesn't allow creating tables with partition transforms, so no > new table could be created by Flink SQL using transforms, but tables > created by other engines could still be used (both read an write). Also you > can create such tables in Flink using the Java API. > > Requiring partition columns be part of the identifier fields is coming > from the practical consideration, that you want to limit the scope of the > equality deletes as much as possible. Otherwise all of the equality deletes > should be table global, and they should be read by every reader. We could > write those, we just decided that we don't want to allow the user to do > this, as it is most cases a bad idea. > > I hope this helps, > Peter > > On Fri, Nov 8, 2024, 22:01 Imran Rashid <iras...@cloudera.com.invalid> > wrote: > >> I'm not down in the weeds at all myself on implementation details, so >> forgive me if I'm wrong about the details here. >> >> I can see all the viewpoints -- both that equality deletes enable some >> use cases, but also make others far more difficult. What surprised me the >> most is that Iceberg does not provide a way to distinguish these two table >> "types". >> >> At first, I thought the presence of an identifier-field ( >> https://iceberg.apache.org/spec/#identifier-field-ids) indicated that >> the table was a target for equality deletes. But, then it turns out >> identifier-fields are also useful for changelog views even without equality >> deletes -- IIUC, they show that a delete + insert should actually be >> interpreted as an update in changelog view. >> >> To be perfectly honest, I'm confused about all of these details -- from >> my read, the spec does not indicate this relationship between >> identifier-fields and equality_ids in equality delete files ( >> https://iceberg.apache.org/spec/#equality-delete-files), but I think >> that is the way Flink works. Flink itself seems to have even more >> limitations -- no partition transforms are allowed, and all partition >> columns must be a subset of the identifier fields. Is that just a Flink >> limitation, or is that the intended behavior in the spec? (Or maybe >> user-error on my part?) Those seem like very reasonable limitations, from >> an implementation point-of-view. But OTOH, as a user, this seems to be >> directly contrary to some of the promises of Iceberg. >> >> Its easy to see if a table already has equality deletes in it, by looking >> at the metadata. But is there any way to indicate that a table (or branch >> of a table) _must not_ have equality deletes added to it? >> >> If that were possible, it seems like we could support both use cases. We >> could continue to optimize for the streaming ingestion use cases using >> equality deletes. But we could also build more optimizations into the >> "non-streaming-ingestion" branches. And we could document the tradeoff so >> it is much clearer to end users. >> >> To maintain compatibility, I suppose that the change would be that >> equality deletes continue to be allowed by default, but we'd add a new >> field to indicate that for some tables (or branches of a table), equality >> deletes would not be allowed. And it would be an error for an engine to >> make an update which added an equality delete to such a table. >> >> Maybe that change would even be possible in V3. >> >> And if all the performance improvements to equality deletes make this a >> moot point, we could drop the field in v4. But it seems like a mistake to >> both limit the non-streaming use-case AND have confusing limitations for >> the end-user in the meantime. >> >> I would happily be corrected about my understanding of all of the above. >> >> thanks! >> Imran >> >> On Tue, Nov 5, 2024 at 9:16 AM Bryan Keller <brya...@gmail.com> wrote: >> >>> I also feel we should keep equality deletes until we have an alternative >>> solution for streaming updates/deletes. >>> >>> -Bryan >>> >>> On Nov 4, 2024, at 8:33 AM, Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>> Well, it seems like I'm a little late, so most of the arguments are >>> voiced. >>> >>> I agree that we should not deprecate the equality deletes until we have >>> a replacement feature. >>> I think one of the big advantages of Iceberg is that it supports batch >>> processing and streaming ingestion too. >>> For streaming ingestion we need a way to update existing data in a >>> performant way, but restricting deletes for the primary keys seems like >>> enough from the streaming perspective. >>> >>> Equality deletes allow a very wide range of applications, which we might >>> be able to narrow down a bit, but still keep useful. So if we want to go >>> down this road, we need to start collecting the requirements. >>> >>> Thanks, >>> Peter >>> >>> Shani Elharrar <sh...@upsolver.com.invalid> ezt írta (időpont: 2024. >>> nov. 1., P, 19:22): >>> >>>> I understand how it makes sense for batch jobs, but it damages stream >>>> jobs, using equality deletes works much better for streaming (which have a >>>> strict SLA for delays), and in order to decrease the performance penalty - >>>> systems can rewrite the equality deletes to positional deletes. >>>> >>>> Shani. >>>> >>>> On 1 Nov 2024, at 20:06, Steven Wu <stevenz...@gmail.com> wrote: >>>> >>>> >>>> Fundamentally, it is very difficult to write position deletes with >>>> concurrent writers and conflicts for batch jobs too, as the inverted index >>>> may become invalid/stale. >>>> >>>> The position deletes are created during the write phase. But conflicts >>>> are only detected at the commit stage. I assume the batch job should fail >>>> in this case. >>>> >>>> On Fri, Nov 1, 2024 at 10:57 AM Steven Wu <stevenz...@gmail.com> wrote: >>>> >>>>> Shani, >>>>> >>>>> That is a good point. It is certainly a limitation for the Flink job >>>>> to track the inverted index internally (which is what I had in mind). It >>>>> can't be shared/synchronized with other Flink jobs or other engines >>>>> writing >>>>> to the same table. >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> On Fri, Nov 1, 2024 at 10:50 AM Shani Elharrar >>>>> <sh...@upsolver.com.invalid> wrote: >>>>> >>>>>> Even if Flink can create this state, it would have to be maintained >>>>>> against the Iceberg table, we wouldn't like duplicates (keys) if other >>>>>> systems / users update the table (e.g manual insert / updates using DML). >>>>>> >>>>>> Shani. >>>>>> >>>>>> On 1 Nov 2024, at 18:32, Steven Wu <stevenz...@gmail.com> wrote: >>>>>> >>>>>> >>>>>> > Add support for inverted indexes to reduce the cost of position >>>>>> lookup. This is fairly tricky to implement for streaming use cases >>>>>> without >>>>>> an external system. >>>>>> >>>>>> Anton, that is also what I was saying earlier. In Flink, the inverted >>>>>> index of (key, committed data files) can be tracked in Flink state. >>>>>> >>>>>> On Fri, Nov 1, 2024 at 2:16 AM Anton Okolnychyi < >>>>>> aokolnyc...@gmail.com> wrote: >>>>>> >>>>>>> I was a bit skeptical when we were adding equality deletes, but >>>>>>> nothing beats their performance during writes. We have to find an >>>>>>> alternative before deprecating. >>>>>>> >>>>>>> We are doing a lot of work to improve streaming, like reducing the >>>>>>> cost of commits, enabling a large (potentially infinite) number of >>>>>>> snapshots, changelog reads, and so on. It is a project goal to excel in >>>>>>> streaming. >>>>>>> >>>>>>> I was going to focus on equality deletes after completing the DV >>>>>>> work. I believe we have these options: >>>>>>> >>>>>>> - Revisit the existing design of equality deletes (e.g. add more >>>>>>> restrictions, improve compaction, offer new writers). >>>>>>> - Standardize on the view-based approach [1] to handle streaming >>>>>>> upserts and CDC use cases, potentially making this part of the spec. >>>>>>> - Add support for inverted indexes to reduce the cost of position >>>>>>> lookup. This is fairly tricky to implement for streaming use cases >>>>>>> without >>>>>>> an external system. Our runtime filtering in Spark today is equivalent >>>>>>> to >>>>>>> looking up positions in an inverted index represented by another Iceberg >>>>>>> table. That may still not be enough for some streaming use cases. >>>>>>> >>>>>>> [1] - https://www.tabular.io/blog/hello-world-of-cdc/ >>>>>>> >>>>>>> - Anton >>>>>>> >>>>>>> чт, 31 жовт. 2024 р. о 21:31 Micah Kornfield <emkornfi...@gmail.com> >>>>>>> пише: >>>>>>> >>>>>>>> I agree that equality deletes have their place in streaming. I >>>>>>>> think the ultimate decision here is how opinionated Iceberg wants to >>>>>>>> be on >>>>>>>> its use-cases. If it really wants to stick to its origins of "slow >>>>>>>> moving >>>>>>>> data", then removing equality deletes would be inline with this. I >>>>>>>> think >>>>>>>> the other high level question is how much we allow for partially >>>>>>>> compatible >>>>>>>> features (the row lineage use-case feature was explicitly approved >>>>>>>> excluding equality deletes, and people seemed OK with it at the time. >>>>>>>> If >>>>>>>> all features need to work together, then maybe we need to rethink the >>>>>>>> design here so it can be forward compatible with equality deletes). >>>>>>>> >>>>>>>> I think one issue with equality deletes as stated in the spec is >>>>>>>> that they are overly broad. I'd be interested if people have any use >>>>>>>> cases >>>>>>>> that differ, but I think one way of narrowing (and probably a necessary >>>>>>>> building block for building something better) the specification scope >>>>>>>> on >>>>>>>> equality deletes is to focus on upsert/Streaming deletes. Two >>>>>>>> proposals in >>>>>>>> this regard are: >>>>>>>> >>>>>>>> 1. Require that equality deletes can only correspond to unique >>>>>>>> identifiers for the table. >>>>>>>> 2. Consider requiring that for equality deletes on partitioned >>>>>>>> tables, that the primary key must contain a partition column (I believe >>>>>>>> Flink at least already does this). It is less clear to me that this >>>>>>>> would >>>>>>>> meet all existing use-cases. But having this would allow for better >>>>>>>> incremental data-structures, which could then be partition based. >>>>>>>> >>>>>>>> Narrow scope to unique identifiers would allow for further building >>>>>>>> blocks already mentioned, like a secondary index (possible via LSM >>>>>>>> tree), >>>>>>>> that would allow for better performance overall. >>>>>>>> >>>>>>>> I generally agree with the sentiment that we shouldn't deprecate >>>>>>>> them until there is a viable replacement. With all due respect to my >>>>>>>> employer, let's not fall into the Google trap [1] :) >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Micah >>>>>>>> >>>>>>>> [1] https://goomics.net/50/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Oct 31, 2024 at 12:35 PM Alexander Jo < >>>>>>>> alex...@starburstdata.com> wrote: >>>>>>>> >>>>>>>>> Hey all, >>>>>>>>> >>>>>>>>> Just to throw my 2 cents in, I agree with Steven and others that >>>>>>>>> we do need some kind of replacement before deprecating equality >>>>>>>>> deletes. >>>>>>>>> They certainly have their problems, and do significantly increase >>>>>>>>> complexity as they are now, but the writing of position deletes is too >>>>>>>>> expensive for certain pipelines. >>>>>>>>> >>>>>>>>> We've been investigating using equality deletes for some of our >>>>>>>>> workloads at Starburst, the key advantage we were hoping to leverage >>>>>>>>> is >>>>>>>>> cheap, effectively random access lookup deletes. >>>>>>>>> Say you have a UUID column that's unique in a table and want to >>>>>>>>> delete a row by UUID. With position deletes each delete is expensive >>>>>>>>> without an index on that UUID. >>>>>>>>> With equality deletes each delete is cheap and while >>>>>>>>> reads/compaction is expensive but when updates are frequent and reads >>>>>>>>> are >>>>>>>>> sporadic that's a reasonable tradeoff. >>>>>>>>> >>>>>>>>> Pretty much what Jason and Steven have already said. >>>>>>>>> >>>>>>>>> Maybe there are some incremental improvements on equality deletes >>>>>>>>> or tips from similar systems that might alleviate some of their >>>>>>>>> problems? >>>>>>>>> >>>>>>>>> - Alex Jo >>>>>>>>> >>>>>>>>> On Thu, Oct 31, 2024 at 10:58 AM Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> We probably all agree with the downside of equality deletes: it >>>>>>>>>> postpones all the work on the read path. >>>>>>>>>> >>>>>>>>>> In theory, we can implement position deletes only in the Flink >>>>>>>>>> streaming writer. It would require the tracking of last committed >>>>>>>>>> data >>>>>>>>>> files per key, which can be stored in Flink state (checkpointed). >>>>>>>>>> This is >>>>>>>>>> obviously quite expensive/challenging, but possible. >>>>>>>>>> >>>>>>>>>> I like to echo one benefit of equality deletes that Russel called >>>>>>>>>> out in the original email. Equality deletes would never have >>>>>>>>>> conflicts. >>>>>>>>>> that is important for streaming writers (Flink, Kafka connect, ...) >>>>>>>>>> that >>>>>>>>>> commit frequently (minutes or less). Assume Flink can write position >>>>>>>>>> deletes only and commit every 2 minutes. The long-running nature of >>>>>>>>>> streaming jobs can cause frequent commit conflicts with background >>>>>>>>>> delete >>>>>>>>>> compaction jobs. >>>>>>>>>> >>>>>>>>>> Overall, the streaming upsert write is not a well solved problem >>>>>>>>>> in Iceberg. This probably affects all streaming engines (Flink, Kafka >>>>>>>>>> connect, Spark streaming, ...). We need to come up with some better >>>>>>>>>> alternatives before we can deprecate equality deletes. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Oct 31, 2024 at 8:38 AM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> For users of Equality Deletes, what are the key benefits to >>>>>>>>>>> Equality Deletes that you would like to preserve and could you >>>>>>>>>>> please share >>>>>>>>>>> some concrete examples of the queries you want to run (and the >>>>>>>>>>> schemas and >>>>>>>>>>> data sizes you would like to run them against) and the latencies >>>>>>>>>>> that would >>>>>>>>>>> be acceptable? >>>>>>>>>>> >>>>>>>>>>> On Thu, Oct 31, 2024 at 10:05 AM Jason Fine >>>>>>>>>>> <ja...@upsolver.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Representing Upsolver here, we also make use of Equality >>>>>>>>>>>> Deletes to deliver high frequency low latency updates to our >>>>>>>>>>>> clients at >>>>>>>>>>>> scale. We have customers using them at scale and demonstrating the >>>>>>>>>>>> need and >>>>>>>>>>>> viability. We automate the process of converting them into >>>>>>>>>>>> positional >>>>>>>>>>>> deletes (or fully applying them) for more efficient engine queries >>>>>>>>>>>> in the >>>>>>>>>>>> background giving our users both low latency and good query >>>>>>>>>>>> performance. >>>>>>>>>>>> >>>>>>>>>>>> Equality Deletes were added since there isn't a good way to >>>>>>>>>>>> solve frequent updates otherwise. It would require some sort of >>>>>>>>>>>> index >>>>>>>>>>>> keeping track of every record in the table (by a predetermined PK) >>>>>>>>>>>> and >>>>>>>>>>>> maintaining such an index is a huge task that every tool >>>>>>>>>>>> interested in this >>>>>>>>>>>> would need to re-implement. It also becomes a bottleneck limiting >>>>>>>>>>>> table >>>>>>>>>>>> sizes. >>>>>>>>>>>> >>>>>>>>>>>> I don't think they should be removed without providing an >>>>>>>>>>>> alternative. Positional Deletes have a different performance >>>>>>>>>>>> profile >>>>>>>>>>>> inherently, requiring more upfront work proportional to the table >>>>>>>>>>>> size. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Oct 31, 2024 at 2:45 PM Jean-Baptiste Onofré < >>>>>>>>>>>> j...@nanthrax.net> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Russell >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the nice writeup and the proposal. >>>>>>>>>>>>> >>>>>>>>>>>>> I agree with your analysis, and I have the same feeling. >>>>>>>>>>>>> However, I >>>>>>>>>>>>> think there are more than Flink that write equality delete >>>>>>>>>>>>> files. So, >>>>>>>>>>>>> I agree to deprecate in V3, but maybe be more "flexible" about >>>>>>>>>>>>> removal >>>>>>>>>>>>> in V4 in order to give time to engines to update. >>>>>>>>>>>>> I think that by deprecating equality deletes, we are clearly >>>>>>>>>>>>> focusing >>>>>>>>>>>>> on read performance and "consistency" (more than write). It's >>>>>>>>>>>>> not >>>>>>>>>>>>> necessarily a bad thing but the streaming platform and data >>>>>>>>>>>>> ingestion >>>>>>>>>>>>> platforms will be probably concerned about that (by using >>>>>>>>>>>>> positional >>>>>>>>>>>>> deletes, they will have to scan/read all datafiles to find the >>>>>>>>>>>>> position, so painful). >>>>>>>>>>>>> >>>>>>>>>>>>> So, to summarize: >>>>>>>>>>>>> 1. Agree to deprecate equality deletes, but -1 to commit any >>>>>>>>>>>>> target >>>>>>>>>>>>> for deletion before having a clear path for streaming platforms >>>>>>>>>>>>> (Flink, Beam, ...) >>>>>>>>>>>>> 2. In the meantime (during the deprecation period), I propose >>>>>>>>>>>>> to >>>>>>>>>>>>> explore possible improvements for streaming platforms (maybe >>>>>>>>>>>>> finding a >>>>>>>>>>>>> way to avoid full data files scan, ...) >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks ! >>>>>>>>>>>>> Regards >>>>>>>>>>>>> JB >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Oct 30, 2024 at 10:06 PM Russell Spitzer >>>>>>>>>>>>> <russell.spit...@gmail.com> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Background: >>>>>>>>>>>>> > >>>>>>>>>>>>> > 1) Position Deletes >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > Writers determine what rows are deleted and mark them in a 1 >>>>>>>>>>>>> for 1 representation. With delete vectors this means every data >>>>>>>>>>>>> file has at >>>>>>>>>>>>> most 1 delete vector that it is read in conjunction with to >>>>>>>>>>>>> excise deleted >>>>>>>>>>>>> rows. Reader overhead is more or less constant and is very >>>>>>>>>>>>> predictable. >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > The main cost of this mode is that deletes must be >>>>>>>>>>>>> determined at write time which is expensive and can be more >>>>>>>>>>>>> difficult for >>>>>>>>>>>>> conflict resolution >>>>>>>>>>>>> > >>>>>>>>>>>>> > 2) Equality Deletes >>>>>>>>>>>>> > >>>>>>>>>>>>> > Writers write out reference to what values are deleted (in a >>>>>>>>>>>>> partition or globally). There can be an unlimited number of >>>>>>>>>>>>> equality >>>>>>>>>>>>> deletes and they all must be checked for every data file that is >>>>>>>>>>>>> read. The >>>>>>>>>>>>> cost of determining deleted rows is essentially given to the >>>>>>>>>>>>> reader. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Conflicts almost never happen since data files are not >>>>>>>>>>>>> actually changed and there is almost no cost to the writer to >>>>>>>>>>>>> generate >>>>>>>>>>>>> these. Almost all costs related to equality deletes are passed on >>>>>>>>>>>>> to the >>>>>>>>>>>>> reader. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Proposal: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Equality deletes are, in my opinion, unsustainable and we >>>>>>>>>>>>> should work on deprecating and removing them from the >>>>>>>>>>>>> specification. At >>>>>>>>>>>>> this time, I know of only one engine (Apache Flink) which >>>>>>>>>>>>> produces these >>>>>>>>>>>>> deletes but almost all engines have implementations to read them. >>>>>>>>>>>>> The cost >>>>>>>>>>>>> of implementing equality deletes on the read path is difficult and >>>>>>>>>>>>> unpredictable in terms of memory usage and compute complexity. >>>>>>>>>>>>> We’ve had >>>>>>>>>>>>> suggestions of implementing rocksdb inorder to handle ever >>>>>>>>>>>>> growing sets of >>>>>>>>>>>>> equality deletes which in my opinion shows that we are going down >>>>>>>>>>>>> the wrong >>>>>>>>>>>>> path. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Outside of performance, Equality deletes are also difficult >>>>>>>>>>>>> to use in conjunction with many other features. For example, any >>>>>>>>>>>>> features >>>>>>>>>>>>> requiring CDC or Row lineage are basically impossible when >>>>>>>>>>>>> equality deletes >>>>>>>>>>>>> are in use. When Equality deletes are present, the state of the >>>>>>>>>>>>> table can >>>>>>>>>>>>> only be determined with a full scan making it difficult to update >>>>>>>>>>>>> differential structures. This means materialized views or indexes >>>>>>>>>>>>> need to >>>>>>>>>>>>> essentially be fully rebuilt whenever an equality delete is added >>>>>>>>>>>>> to the >>>>>>>>>>>>> table. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Equality deletes essentially remove complexity from the >>>>>>>>>>>>> write side but then add what I believe is an unacceptable level of >>>>>>>>>>>>> complexity to the read side. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Because of this I suggest we deprecate Equality Deletes in >>>>>>>>>>>>> V3 and slate them for full removal from the Iceberg Spec in V4. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I know this is a big change and compatibility breakage so I >>>>>>>>>>>>> would like to introduce this idea to the community and solicit >>>>>>>>>>>>> feedback >>>>>>>>>>>>> from all stakeholders. I am very flexible on this issue and would >>>>>>>>>>>>> like to >>>>>>>>>>>>> hear the best issues both for and against removal of Equality >>>>>>>>>>>>> Deletes. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks everyone for your time, >>>>>>>>>>>>> > >>>>>>>>>>>>> > Russ Spitzer >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> *Jason Fine* >>>>>>>>>>>> Chief Software Architect >>>>>>>>>>>> ja...@upsolver.com | www.upsolver.com >>>>>>>>>>>> >>>>>>>>>>> >>>