Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Gyula Fóra Fri, 09 May 2025 03:16:46 -0700

Hi Anton,

Thank you for summarizing the options we see at this stage in a structured
and concise way.


Based on the use-cases I see in the industry, I feel that not all of the
highlighted options are feasible (or desirable).


Option 4 would basically remove any possibilities for native streaming CDC
reads on tables, severely limiting how Iceberg can be used in the future
for real-time use-cases from Flink and other similar engines that may want
to connect. I understand that the view reconciliation approach is
implemented in Spark SQL already but implementing it in a proper streaming
way would likely lead to similar problems that we are trying to solve here
in the first place.


Regarding Option 1, introducing engine specific/custom indexing solutions
would go against the core design principles as it would be hard to mix
different engines when writing / reading tables for CDC use-cases. (A
streaming job would have a hard time writing upserts/equality deletes to
tables that were written by a different engine). To me this sounds very
similar to Option 4 in a way that it pushes too much logic to the engines
in a way that would hurt the compatibility across the engines.


Based on this and the context of the discussion, Option 2 or a combination
of Option 2 & 3 sounds the most reasonable to me. There are still a lot of
questions on the practical implementation of the indexes and how we can do
this efficiently so this is only a very early feedback from my end.


Cheers,

Gyula



On Fri, May 9, 2025 at 12:14 AM Anton Okolnychyi <aokolnyc...@gmail.com>
wrote:

> I am glad to see that folks are thinking about this problem. I am looking
> forward to a formal proposal/design doc to discuss details!
>
> Overall, this aligns with what we discussed in the
> community earlier w.r.t. the future of equality deletes and streaming
> upserts. If I were to summarize, we have the following options:
>
> Option 1: Add an inverted index (potentially distributed) maintained by an
> engine that does streaming writes to always produce DVs, even in streaming
> use cases. Deprecate/remove equality deletes from Iceberg.
> Option 2: Add native indexing to Iceberg so that the lookup of
> positions is quick and efficient enough to be used in streaming upserts.
> Deprecate/remove equality deletes from Iceberg.
> Option 3: Rethink equality deletes, potentially by introducing more
> restrictions and trying to scope them to particular data files, similar to
> DVs.
> Option 4: Standardize on a view reconciliation approach that Tabular
> implemented for CDC.
>
> I would like to highlight that what Spark does today during MERGE is
> similar to a lookup in an inverted index represented by another Iceberg
> table. That is OK for batch jobs but not enough for streaming.
>
> - Anton
>
> чт, 8 трав. 2025 р. о 10:08 Xiaoxuan Li <xiaoxuan.li....@gmail.com> пише:
>
>> Hi Zheng, Steven, Amogh and Gyula. Thank you all for the feedback!
>>
>> I agree with everyone, we need to narrow down the scope of this
>> optimization. The primary issue I'm trying to address is the slow read
>> performance caused by the growing number of equality delete files(streaming
>> CDC scenarios). The other potential use cases are only mentioned to show
>> the extensibility of this approach.
>>
>> And both equality and positional deletes suffer from the same core
>> problem, records are evaluated repeatedly against multiple delete
>> predicates, at read time for equality deletes, and at write time for
>> positional deletes. This repeated evaluation is where the real bottleneck
>> lies.
>>
>> It’s particularly bad for equality deletes, since we constantly recompute
>> row positions during reads without ever materializing them. Eventually, a
>> maintenance job is required just to rewrite them into positional deletes.
>>
>> The idea is to eliminate this overhead by introducing an inverted index
>> that maps values (or value combinations) directly to row positions. This
>> lets us skip full predicate evaluation and jump straight to the affected
>> rows, similar to how Hudi uses a record-level index for fast upserts. If we
>> can fetch row positions from the index, we can greatly reduce overhead
>> during reads (for equality deletes) and writes (for positional deletes).
>>
>> In fact, if we can make positional deletes perform as well as equality
>> deletes using this index, we might be able to get rid of equality deletes,
>> but that needs to be evaluated for the upsert case.
>>
>> That’s part of the reason why other types of indexes came up, they’re
>> applicable beyond just primary key columns, but CDC is the only scenario
>> with stricter SLA requirements. So it makes sense to align on the exact
>> problem and use cases. For now, I think we can define our main goal as
>> supporting inverted indexing over primary key columns to address the
>> slowness of reading caused by the growing number of equality delete files.
>>
>> Thanks, Amogh, for bringing up the comparison with known alternatives. We
>> should include benchmarks for those as well, to illustrate the trade-offs
>> in read/write performance, storage usage, and overall cost.
>>
>> Really appreciate your feedback! I’ll incorporate these into the next
>> revision of the proposal.
>>
>>
>> Thanks,
>>
>> Xiaoxuan
>>
>> On Thu, May 8, 2025 at 1:25 AM Gyula Fóra <gyula.f...@gmail.com> wrote:
>>
>>> Thank you for the proposal!
>>>
>>> I agree with what had been said above that we need to narrow down the
>>> scope here and what is the primary target for the optimization.
>>>
>>> As Amogh has also pointed out, CDC (streaming) read performance (with
>>> equality deletes) would be one of the biggest beneficiaries of this at a
>>> first glance.
>>> This is especially important for Flink users where this feature is
>>> currently completely missing and there is a big demand for it as we rely on
>>> equality deletes on the write path. [1]
>>>
>>> I am not aware of alternative proposals that would solve the equality
>>> delete cdc read performance question, overall I think using indices is
>>> reasonable and a very promising approach.
>>>
>>> Looking forward to more details and discussion!
>>> Gyula
>>>
>>> [1] https://lists.apache.org/thread/njmxjmjjm341fp4mgynn483v15mhk7qd
>>>
>>>
>>> On Thu, May 8, 2025 at 9:24 AM Amogh Jahagirdar <2am...@gmail.com>
>>> wrote:
>>>
>>>> Thank you for the proposal Xiaoxuan! I think I agree with Zheng and
>>>> Steven's point that it'll probably be more helpful to start out with more
>>>> specific "what" and "why" (known areas of improvement for Iceberg and
>>>> driven by any use cases) before we get too deep into the "how".
>>>>
>>>> In my mind, the specific known area of improvement for Iceberg related
>>>> to this proposal is improving streaming upsert behavior. One area this
>>>> improvement is beneficial for is being able to provide better data
>>>> freshness for Iceberg CDC mirror tables without the heavy read +
>>>> maintenance cost that currently exist with Flink upserts.
>>>>
>>>> As you mentioned, equality deletes have the benefit of being very cheap
>>>> to write but can come at a high and unpredictable cost at read time.
>>>> Challenges with equality deletes have been discussed in the past [1].
>>>> I'll also add that if one of the goals is to improving streaming
>>>> upserts (e.g. for applying CDC change streams into Iceberg mirror tables),
>>>> then there are alternatives that I think we should compare against to make
>>>> the tradeoffs clear. These alternatives include leveraging the known
>>>> changelog view or merge patterns [2] or improving the existing maintenance
>>>> procedures.
>>>>
>>>> I think the potential for being able to use a inverted index for upsert
>>>> cases to more directly identify positions in a file to directly write DVs
>>>> is very exciting, but before getting too far into the weeds, I think it'd
>>>> first be helpful
>>>> to make sure we agree on the specific problem we're trying to solve
>>>> when we talk about performance improvements along with any use cases,
>>>> followed by comparison with known alternatives (ideally we can get numbers
>>>> that demonstrate the read/write/storage/cost tradeoffs for the proposed
>>>> inverted index).
>>>>
>>>> [1]https://lists.apache.org/thread/z0gvco6hn2bpgngvk4h6xqrnw8b32sw6
>>>> [2]https://www.tabular.io/blog/hello-world-of-cdc/
>>>>
>>>> Thanks,
>>>> Amogh Jahagirdar
>>>>
>>>

Re: [DISCUSS] Pre-Proposal: Improving Merge-On-Read Query Performance With Indexing

Reply via email to