Re: Re: Dedicated sync for Iceberg Index Support

huaxin gao Fri, 26 Jun 2026 10:41:14 -0700

Hi all, Here is the summary of this Monday's index meeting:

Two blocking decisions closed:


   1. Index is *not* a table— it's its own object (reuses table machinery
   under the hood). Reasoning: requires a sort order, no column updates, no
   partition spec, no overlapping ranges between leaves, inherits base-table
   permissions, and has its own CREATE/DROP/UPDATE INDEX DDL.
   2. Index is a separate catalog entity, with no pointers in table
   metadata— has its own REST endpoints; the catalog can optionally return
   index metadata withloadTable to avoid extra round-trips. Keeps table and
   index updates independent/async.


Next steps: Start writing the spec and build out the copy-on-write path now.

Here are the draft spec:
secondary index spec <https://github.com/apache/iceberg/pull/16961>
irc spec <https://github.com/apache/iceberg/pull/16963>

Thanks,
Huaxin

On Sat, Jun 20, 2026 at 11:11 AM huaxin gao <[email protected]> wrote:

> Hi all,
>
> I built a standalone PoC to validate that the basic index structure works:
> that we can build a PK index, convert equality deletes to position deletes
> through it, and have every converted delete land on the correct live row. I
> ran it up to *100M keys*.
>
> *Headline: the structure works.* The index builds over up to 100M keys,
> the eq-delete → position-delete conversion resolved correctly at *every* size
> (100% of converted deletes mapped to the right live row), and the resulting
> position deletes are *~8× cheaper to apply* at query time than the
> equality deletes they replace.
>
> Beyond correctness, the run also shows how the index’s *maintenance* cost
> scales, comparing copy-on-write (COW, rewrite touched leaves) vs an
> append/merge (MOR) option, under a realistic mixed CDC checkpoint (1,000
> insert + 500 update + 500 delete), local wall-clock:
> keys EQ baseline INDEX (COW) % of 60s (COW) INDEX (MOR) % of 60s (MOR)
> correct
> 5M 6 ms 6.7s 11.2% 2.2s 3.7% PASS
> 20M 8 ms 24.2s 40.4% 6.4s 10.6% PASS
> 50M 7 ms 51.6s 86.1% 12.2s 20.4% PASS
> *100M* 6 ms *75.0s* 125% (BEHIND) *16.9s* 28.2% (keeps up) PASS
>
> COW maintenance crosses the 60 s checkpoint around 100M (75 s/cycle,
> 125%); MOR stays at ~28% and keeps pace; the equality-delete baseline is
> ~6 ms and flat. So the structure works, but *COW alone can’t sustain
> scattered CDC at hundreds of millions of keys on a single writer*. It’s
> worth allowing a merge-on-read / update-file maintenance option alongside
> COW (or sharding the index across parallel writers).
>
> *Full write-up, all tables, and the in-region reality-check:* link
> <https://docs.google.com/document/d/1G3zxbW8X0eU3UrouslZfp42bBc9CvgJGnJyDONCB4PU/edit?tab=t.0>
>
> Feedback welcome, especially on the spec direction (whether to allow a
> merge-on-read / update-file maintenance option alongside COW)  and on the
> read-side modeling.
>
> Thanks,
> Huaxin
>
> On Tue, Jun 9, 2026 at 5:45 PM huaxin gao <[email protected]> wrote:
>
>> Sorry,  we've skipped posting a few of the dedicated index-sync summaries
>> to the mailing list; you can find those in the Google doc
>> <https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?pli=1&tab=t.8041k7j2n7y3>
>> and the Slack channel. Here's yesterday's summary:
>>
>> *Decided*
>>
>>    -
>>
>>    Index vs. table (what we agreed):
>>    -
>>
>>       Reuse table implementation/library code and a near-identical spec
>>       — the commit path will be custom regardless, so reuse isn't the 
>> deciding
>>       factor.
>>       -
>>
>>       An index is not a table from a user/API view: loading or writing
>>       an index as a table must fail(it would violate index invariants).
>>       -
>>
>>       The spec forbids most table behaviors: no overlapping files, one
>>       mandatory transform sort order, no column updates, no partition spec.
>>       -
>>
>>    Delete vectors: reuse Iceberg's existing DV — benchmarks showed no
>>    new delete format is worth introducing.
>>    -
>>
>>    Incremental updates: start with copy-on-write only (no update files).
>>    For object-store-sized leaves, a full leaf rewrite is about as cheap as
>>    maintaining an overlay update file + DV, so we'll skip the MOR machinery
>>    for now and add it later only if benchmarks prove we need it (likely just
>>    the very-large-leaf case).
>>    -
>>
>>    Validate the spec first: build a quick, hand-wired prototype (Parquet
>>    files structured per the spec) and benchmark it on real scales before
>>    formalizing.
>>
>> *Leaning, not final*
>>
>>    -
>>
>>    Indexes are likely separate catalog objects, linked from the table by
>>    storing just an identifier (like materialized views) and not visible in 
>> LIST
>>    TABLES.
>>    -
>>
>>    We'll need a commit path for indexes, but simpler than tables (no
>>    stage-create).
>>
>> *Still open*
>>
>>
>>    -
>>
>>    Permissions model — separate vs. inherited (action: look at what real
>>    DBs do for index permissions).
>>    -
>>
>>    REST/catalog RPC design — minimize round-trips; index metadata
>>    ideally returned with LOAD TABLE. Catalog RPC cost may dominate
>>    Parquet IO, so this needs real design.
>>    -
>>
>>    Scale modeling — target rows-per-leaf vs. leaf size vs. metadata-file
>>    count.
>>    -
>>
>>    DDL-on-index semantics (reuse table schema-update actions or separate)
>>
>>
>> Thanks,
>> Huaxin
>>
>> On Wed, Apr 22, 2026 at 8:47 AM Péter Váry <[email protected]>
>> wrote:
>>
>>> Hi All,
>>>
>>> TL;DR
>>> We still need to validate with ADLS and S3, but based on the local
>>> tests, the MPHF approach looks more promising if we can tolerate larger
>>> files and longer index maintenance times.
>>>
>>> Details:
>>> Here are the results from the local experiments on my Mac. I removed
>>> unnecessary statistics from the Parquet files and tested different row
>>> group sizes:
>>>
>>>    - For an index file with 1M records, a row group size of 5,000
>>>    appears to be the sweet spot.
>>>    - For 10M records, 10,000 rows per row group works best.
>>>
>>> If you have additional ideas for optimizing Parquet-based indexes, I’d
>>> be very interested to hear them.
>>> The test code is available on this branch:
>>> https://github.com/pvary/iceberg/tree/leaf_bench
>>>
>>> Best results:
>>> *1m records/file*
>>>
>>>    - Parquet - 5000 row/RowGroup
>>>       - Read: 1191 µs - 1 file open, 3 seek, 123KB read per lookup
>>>       - Write: 1.7 s, 15 MB
>>>    - MPHF
>>>       - Read: 202 µs - 1 file open, 1 seek,  282KB read per lookup
>>>       - Write: 0.8 s, 34 MB
>>>
>>> *10m records/file*
>>>
>>>    - Parquet - 10000 row/RowGroup
>>>       - Read: 4168 µs - 1 file open, 3 seek, 395KB read per lookup
>>>       - Write: 19.5s s, 144 MB
>>>    - MPHF
>>>       - Read: 1086 µs - 1 file open, 1 seek,  2.8 MB (2812KB) read per
>>>       lookup
>>>       - Write: 6.5 s, 34 MB, 353 MB
>>>
>>> Below are the full results.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *Benchmark                                      (indexType)  (keyType)
>>>  (numRows)  Mode    Cnt            Score          Error
>>>  UnitsInvertedIndexBenchmark.lookup                 PARQUET_1000       LONG
>>>    1000000    ss  10000         3285.284 ±        5.138
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_1000       LONG
>>>    1000000    ss  10000   2522168989.000
>>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_1000       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_1000       LONG
>>>  1000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                 PARQUET_1000       LONG
>>> 10000000    ss  10000        35449.614 ±       34.673
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_1000       LONG
>>>   10000000    ss  10000  24302649201.000
>>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_1000       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_1000       LONG
>>> 10000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                 PARQUET_5000       LONG
>>>  1000000    ss  10000         1191.959 ±        4.169
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_5000       LONG
>>>    1000000    ss  10000   1230877229.000
>>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_5000       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_5000       LONG
>>>  1000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                 PARQUET_5000       LONG
>>> 10000000    ss  10000         7236.447 ±       10.374
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead       PARQUET_5000       LONG
>>>   10000000    ss  10000   5650715973.000
>>> #InvertedIndexBenchmark.lookup:openStreams     PARQUET_5000       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks           PARQUET_5000       LONG
>>> 10000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                PARQUET_10000       LONG
>>>  1000000    ss  10000         1349.946 ±        7.834
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_10000       LONG
>>>    1000000    ss  10000   1730219377.000
>>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_10000       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_10000       LONG
>>>  1000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                PARQUET_10000       LONG
>>> 10000000    ss  10000         4168.635 ±       11.051
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_10000       LONG
>>>   10000000    ss  10000   3946341532.000
>>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_10000       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_10000       LONG
>>> 10000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                PARQUET_50000       LONG
>>>  1000000    ss  10000         4736.466 ±       38.179
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_50000       LONG
>>>    1000000    ss  10000   7427413541.000
>>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_50000       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_50000       LONG
>>>  1000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                PARQUET_50000       LONG
>>> 10000000    ss  10000         4979.031 ±       34.708
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead      PARQUET_50000       LONG
>>>   10000000    ss  10000   7694887636.000
>>> #InvertedIndexBenchmark.lookup:openStreams    PARQUET_50000       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks          PARQUET_50000       LONG
>>> 10000000    ss  10000        30000.000
>>> #InvertedIndexBenchmark.lookup                         MPHF       LONG
>>>  1000000    ss  10000          202.571 ±        2.336
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead               MPHF       LONG
>>>    1000000    ss  10000   2821570000.000
>>> #InvertedIndexBenchmark.lookup:openStreams             MPHF       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks                   MPHF       LONG
>>>  1000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup                         MPHF       LONG
>>> 10000000    ss  10000         1086.957 ±        4.524
>>>  us/opInvertedIndexBenchmark.lookup:bytesRead               MPHF       LONG
>>>   10000000    ss  10000  28119460000.000
>>> #InvertedIndexBenchmark.lookup:openStreams             MPHF       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.lookup:seeks                   MPHF       LONG
>>> 10000000    ss  10000        10000.000
>>> #InvertedIndexBenchmark.write                  PARQUET_1000       LONG
>>>  1000000    ss      3      1720731.014 ±   876636.004
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_1000       LONG
>>>    1000000    ss      3     46453317.000
>>> #InvertedIndexBenchmark.write                  PARQUET_1000       LONG
>>> 10000000    ss      3     18547947.876 ± 12258125.307
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_1000       LONG
>>>   10000000    ss      3    452655675.000
>>> #InvertedIndexBenchmark.write                  PARQUET_5000       LONG
>>>  1000000    ss      3      1718345.583 ±  1103928.016
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_5000       LONG
>>>    1000000    ss      3     44845788.000
>>> #InvertedIndexBenchmark.write                  PARQUET_5000       LONG
>>> 10000000    ss      3     18604229.931 ±  2668361.915
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes   PARQUET_5000       LONG
>>>   10000000    ss      3    435388818.000
>>> #InvertedIndexBenchmark.write                 PARQUET_10000       LONG
>>>  1000000    ss      3      1761555.389 ±   535857.675
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_10000       LONG
>>>    1000000    ss      3     44536635.000
>>> #InvertedIndexBenchmark.write                 PARQUET_10000       LONG
>>> 10000000    ss      3     19501588.264 ±  2130054.558
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_10000       LONG
>>>   10000000    ss      3    433189623.000
>>> #InvertedIndexBenchmark.write                 PARQUET_50000       LONG
>>>  1000000    ss      3      1936624.889 ±  6601363.985
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_50000       LONG
>>>    1000000    ss      3     44264655.000
>>> #InvertedIndexBenchmark.write                 PARQUET_50000       LONG
>>> 10000000    ss      3     20471742.278 ± 10705206.310
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes  PARQUET_50000       LONG
>>>   10000000    ss      3    431311305.000
>>> #InvertedIndexBenchmark.write                          MPHF       LONG
>>>  1000000    ss      3       896573.958 ±  1408024.851
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes           MPHF       LONG
>>>    1000000    ss      3    102846369.000
>>> #InvertedIndexBenchmark.write                          MPHF       LONG
>>> 10000000    ss      3      6509348.875 ± 15519975.479
>>>  us/opInvertedIndexBenchmark.write:indexFileBytes           MPHF       LONG
>>>   10000000    ss      3   1058435733.000                     #*
>>>
>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. ápr. 21.,
>>> K, 20:53):
>>>
>>>> Hi all,
>>>>
>>>> In recent secondary index sync meetings, the discussion converged on
>>>> the need to define what an index is from first principles before settling
>>>> on physical layout.
>>>>
>>>> To address that, Peter and I have drafted a requirements document for a
>>>> key lookup index (renamed from "primary key index" to avoid implying
>>>> uniqueness enforcement), the goal is to nail down one well-scoped index
>>>> type first.
>>>>
>>>> Doc: Key Lookup Index Requirements
>>>> <https://docs.google.com/document/d/1e0zxK-jA0LBDq8YQlQgFipTHelDFiga8lCkgDTmYub8/edit?tab=t.0#heading=h.8shrgabvl19>
>>>>
>>>> It covers requirements, three design options (manifest + sorted
>>>> Parquet, hash + sorted Parquet, hash + MPHF) and open questions. We will
>>>> add preliminary benchmark results shortly.
>>>>
>>>> Feedback welcome — inline in the doc, on this thread, or at the next
>>>> index sync.
>>>>
>>>> Thanks,
>>>>
>>>> Huaxin
>>>>
>>>> On Mon, Apr 13, 2026 at 7:22 AM Steven Wu <[email protected]> wrote:
>>>>
>>>>> Do we need the special index identifier that was originally proposed?
>>>>> A generic CatalogObjectIdentifier (with namespace and name) would be
>>>>> consistent with all object types in the catalog. I have a discussion 
>>>>> thread
>>>>> on the generic identifier topic: [DISCUSS] REST Spec: generic
>>>>> CatalogObjectIdentifier.
>>>>>
>>>>> Should we add an indexes array field to table metadata? It only
>>>>> contains a list of index object identifiers. It doesn't contain any index
>>>>> metadata which should live in the index objects. Yufei was trying to bring
>>>>> this up at the end of the first sync. But we didn't get enough time to
>>>>> really discuss it. It will be great to discuss this as the first agenda
>>>>> item today.
>>>>>
>>>>> On Mon, Apr 13, 2026 at 3:17 AM Péter Váry <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> We had several engaging discussions at the Iceberg Summit, and it was
>>>>>> great to finally catch up with many of you in person. We truly missed 
>>>>>> those
>>>>>> who couldn’t attend, hopefully we’ll all meet again at the next summit.
>>>>>>
>>>>>> To keep the conversation going, Huaxin and I have put together the
>>>>>> agenda for our next meeting. As a reminder, we’ll meet on *April
>>>>>> 13th, 9:00–10:00 AM *PDT (6:00–7:00 PM CEST).
>>>>>>
>>>>>> Proposed agenda:
>>>>>>
>>>>>>    - Continue first-principles index design discussion from Mar 30
>>>>>>       - *Index Ownership and Write Responsibility*
>>>>>>          - Should writers be allowed to update indexes, or
>>>>>>          - Should all index writes be handled exclusively by the
>>>>>>          Index Maintenance process?
>>>>>>          - If writers can update indexes then we need to define what
>>>>>>          guarantees are required (compaction, file splitting, layout 
>>>>>> expectations)?
>>>>>>          - If only Index Maintenance updates indexes then we only
>>>>>>          need to define what observable properties should be exposed to 
>>>>>> consumers?
>>>>>>          Like:
>>>>>>             - Expected max files for a single key
>>>>>>             - Current max files for a single key
>>>>>>             - Deletes allowed/present
>>>>>>             - Sorted by
>>>>>>             - Partitioned by
>>>>>>          - *Specification Scope: What Belongs in the Spec?*
>>>>>>          - Related to the ownership question above
>>>>>>          - Light spec: Just define that the index table should be
>>>>>>          optimized for retrieval by key columns and the index columns 
>>>>>> should be
>>>>>>          contained in the table. This could give us more flexibility if 
>>>>>> better
>>>>>>          organization methods come up, or
>>>>>>          - Detailed spec: We could define the max number of files
>>>>>>          per index to read for a single key, or even the partitioning 
>>>>>> and the exact
>>>>>>          sort order. This could allow more use-cases for a given index, 
>>>>>> like joins
>>>>>>          or cardinality estimations.
>>>>>>          - I would go for light spec for the main types (PK,
>>>>>>          Containing) and only the Index Maintenance processes should 
>>>>>> update the
>>>>>>          Indexes, as for many use-cases the details are not important, 
>>>>>> and writers
>>>>>>          will very rarely update the Indexes themselves.
>>>>>>       - *Logical Placement of Indexes*
>>>>>>          - Index as a child object of an Iceberg Table, or
>>>>>>          - Index as a first‑class entity under
>>>>>>          /namespace/indexes/{index}
>>>>>>          - Based on the discussions on the summit we are leaning in
>>>>>>          this direction. This means the index id should be unique in the 
>>>>>> namespace
>>>>>>          but helps the catalog implementations quite a bit
>>>>>>       - *Physical Placement of Index Data*
>>>>>>          - I don’t think we should specify this. We should have a
>>>>>>          base location for the index, but can rely on the catalog 
>>>>>> implementations to
>>>>>>          decide on their own, like they do with the tables, views, udfs.
>>>>>>       - *Iceberg Reader Based indexes* (Containing indexes and
>>>>>>       potentially PK indexes). These are the indexes which could be read 
>>>>>> by the
>>>>>>       existing Iceberg readers. We might decide to store the PK index 
>>>>>> similarly
>>>>>>       to an Iceberg Table and treat it as a reader based index.
>>>>>>          - What are the table properties/features exposed to the
>>>>>>          readers
>>>>>>             - Maybe just some behavioral descriptors for the
>>>>>>             optimizer to decide if the index could be used or should be 
>>>>>> skipped, like:
>>>>>>                - Expected max files for a single key
>>>>>>                - max files for a single key
>>>>>>                - Deletes allowed/present
>>>>>>                - Sorted by
>>>>>>                - Partitioned by
>>>>>>             - The Tasks when reading the index based on the filters
>>>>>>             and projection
>>>>>>          - What are the table properties/features exposed to the
>>>>>>          Index Maintenance. I think this could be internal to the Index 
>>>>>> Maintenance
>>>>>>          process and might not be exposed through the spec. The Index 
>>>>>> Maintenance
>>>>>>          process could handle this as a standard Iceberg Table and could 
>>>>>> be based on
>>>>>>          the Table Maintenance process, but there might be some totally 
>>>>>> different
>>>>>>          processes.
>>>>>>       - It should be possible to add properties to an index defined
>>>>>>       by the Index Maintenance process which could be used and updated 
>>>>>> in the
>>>>>>       next Index Maintenance run.
>>>>>>    - *PK index storage format benchmark results*
>>>>>>       - Flat Parquet (baseline)
>>>>>>       - BTree with Parquet leaves
>>>>>>       - Vortex
>>>>>>    - *Open items / next steps*
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>> huaxin gao <[email protected]> ezt írta (időpont: 2026. márc.
>>>>>> 23., H, 3:03):
>>>>>>
>>>>>>> Hi everyone, I wanted to share an update on the primary key index
>>>>>>> work.
>>>>>>> Since there are still open questions on whether bloom filter indexes
>>>>>>> fit in the secondary index framework or should be treated as extended
>>>>>>> stats, I've shifted focus to the primary key index since it's a clearer 
>>>>>>> fit
>>>>>>> for the framework.
>>>>>>> I've put together a proposal for a primary key reverse-lookup index
>>>>>>> that maps each key to its physical location (file_path, row_position). 
>>>>>>> It
>>>>>>> enables:
>>>>>>>
>>>>>>>    - Scan-time file pruning for point lookups
>>>>>>>    - Converting key-based deletes into position deletes
>>>>>>>    (eliminating equality deletes for Flink CDC)
>>>>>>>    - Accelerating Spark MERGE INTO by replacing full-table joins
>>>>>>>    with direct file lookups
>>>>>>>
>>>>>>> Proposal:
>>>>>>> https://docs.google.com/document/d/1HuhCZ0n2FqDh8yqQb9oEj1CPM5yXpEsMPGZno2aSf8E/edit?tab=t.0#heading=h.tbevg4q0m9
>>>>>>> Feedback welcome!
>>>>>>> Thanks,
>>>>>>> Huaxin
>>>>>>>
>>>>>>> On Wed, Mar 18, 2026 at 11:42 PM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Key takeaways from the general index discussion at the May 16
>>>>>>>> meeting.
>>>>>>>> Thanks to everyone who participated! The recording is available
>>>>>>>> here: https://www.youtube.com/watch?v=btmjhtRWUCE
>>>>>>>>
>>>>>>>>    - Q: Do we need to tie index types to the algorithms used to
>>>>>>>>    access them?
>>>>>>>>    - A: From a specification perspective, the goal is to define
>>>>>>>>    the storage-level data layout so it can be shared across engines. 
>>>>>>>> Engines
>>>>>>>>    are free to interpret and use the data as they see fit, but the 
>>>>>>>> on-disk
>>>>>>>>    data layout itself must be strictly defined and interoperable.
>>>>>>>>
>>>>>>>>    - Q: Should we introduce an additional abstraction layer (e.g.,
>>>>>>>>    Vector Index) with sub-types such as IVF and DiskANN?
>>>>>>>>    - A: This is possible if we decide it is beneficial. I explored
>>>>>>>>    potential naming, but it is not yet clear how such a layer would be 
>>>>>>>> used in
>>>>>>>>    practice.
>>>>>>>>    *Question to Yingyi Bu*: could you provide examples where this
>>>>>>>>    additional layer would be useful? Should this abstraction be 
>>>>>>>> defined at the
>>>>>>>>    spec level, or is it better handled at the engine level?
>>>>>>>>    My initial idea was that users would create a generic Vector
>>>>>>>>    Index and let the engine choose the concrete implementation. 
>>>>>>>> However, this
>>>>>>>>    would limit user control and users likely need to specify the exact 
>>>>>>>> index
>>>>>>>>    representation, which implies they must be aware of the available
>>>>>>>>    representations.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Q: Do we want to allow extensibility for index types?
>>>>>>>>    - A: Yes. The intent is to support a small set of well-defined
>>>>>>>>    index types while allowing experimentation with new ones. If a new 
>>>>>>>> index
>>>>>>>>    type proves broadly useful, a follow-up proposal can standardize it 
>>>>>>>> and
>>>>>>>>    incorporate it into the spec.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Q: Do we allow multiple versions of an index for the same
>>>>>>>>    table snapshot?
>>>>>>>>    - A: Yes. Older index versions must be retained for readers
>>>>>>>>    that have already started using them, while new readers should
>>>>>>>>    automatically use the latest available version
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Q: Do we need to use materialized views for these indexes?
>>>>>>>>    - A: No. These indexes are primarily examples, and different
>>>>>>>>    types may require different storage methods. However, the Primary 
>>>>>>>> Key,
>>>>>>>>    Containing, and parts of the IVF indexes can be structured as 
>>>>>>>> Iceberg
>>>>>>>>    tables. This allows engines to read them natively; in some cases, 
>>>>>>>> Iceberg
>>>>>>>>    planners can automatically redirect queries to the index table 
>>>>>>>> without
>>>>>>>>    engine modifications. Furthermore, index maintenance for these 
>>>>>>>> tables can
>>>>>>>>    leverage existing materialized view maintenance workflows. Other 
>>>>>>>> index
>>>>>>>>    types may instead rely on Puffin files or alternative storage 
>>>>>>>> approaches.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    - Q: How should index metadata be accessed? Should we add
>>>>>>>>    explicit pointers for the indexes in the table metadata?
>>>>>>>>    - A: We did not have sufficient time to fully explore and
>>>>>>>>    conclude this topic.
>>>>>>>>    *Question for Yufei Gu*: Did I understand correctly that your
>>>>>>>>    main concern stems from endpoint resolution from a REST Catalog
>>>>>>>>    perspective? Specifically, if indexes are exposed under a URI such 
>>>>>>>> as
>>>>>>>>    v1/{prefix}/namespaces/{namespace}/tables/{table}/indexes/{index}, 
>>>>>>>> would
>>>>>>>>    this make it more difficult for the REST Catalog to resolve and 
>>>>>>>> route
>>>>>>>>    requests to the appropriate endpoint?
>>>>>>>>
>>>>>>>>
>>>>>>>> Suhas Jayaram Subramanya via dev <[email protected]> ezt írta
>>>>>>>> (időpont: 2026. márc. 13., P, 23:32):
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> Here's a proposal for native Vector Index support in Iceberg
>>>>>>>>> tables --
>>>>>>>>> https://docs.google.com/document/d/1KL4qLOwdqnhOcqTc0EjO1O16NV3M3c-gZCEINDWw4lA/edit?usp=sharing
>>>>>>>>>
>>>>>>>>> We've been working on this proposal with Peter internally at
>>>>>>>>> Microsoft and he suggested we post it here to bring this to the 
>>>>>>>>> community's
>>>>>>>>> attention, ahead of the next Secondary Index Sync.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Suhas
>>>>>>>>>
>>>>>>>>> On 2026/02/19 04:34:34 huaxin gao wrote:
>>>>>>>>> > Hi Everyone,
>>>>>>>>> >
>>>>>>>>> > Here are the recording and notes from the Iceberg Index Support
>>>>>>>>> Sync on
>>>>>>>>> > 2/11.
>>>>>>>>> >
>>>>>>>>> > Recording: https://www.youtube.com/watch?v=3sFfQ0A50yk
>>>>>>>>> >
>>>>>>>>> > Notes:
>>>>>>>>> >
>>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.8041k7j2n7y3
>>>>>>>>> >
>>>>>>>>> > The meeting will move to biweekly, Mondays 9–10am PST, starting
>>>>>>>>> March 2.
>>>>>>>>> >
>>>>>>>>> > Since the sync, I updated the Bloom skipping index proposal
>>>>>>>>> > <
>>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.5r5kl6k3fqwu
>>>>>>>>> >
>>>>>>>>> > to address the discussion questions, specifically:
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > - Performance justification: when this helps (high-cardinality =
>>>>>>>>> / IN,
>>>>>>>>> > many data files, high object-store latency) and how it differs
>>>>>>>>> from Parquet
>>>>>>>>> > row-group Bloom filters (which still require opening the data
>>>>>>>>> file).
>>>>>>>>> > - Cost / scalability: rough sizing (Bloom blob size per file,
>>>>>>>>> Puffin
>>>>>>>>> > file size), the planning cost trade-off (driver index reads vs
>>>>>>>>> executor
>>>>>>>>> > file opens), and mitigations via caching.
>>>>>>>>> > - Lifecycle / maintenance: incremental production as new data
>>>>>>>>> files
>>>>>>>>> > arrive, behavior when the index is missing/behind, and
>>>>>>>>> sharding/compaction
>>>>>>>>> > plus cleanup to avoid accumulating too many small Puffin files
>>>>>>>>> over time.
>>>>>>>>> > - Writer expectations: inline (optional) vs asynchronous
>>>>>>>>> (primary) index
>>>>>>>>> > creation.
>>>>>>>>> >
>>>>>>>>> > I also implemented a Spark 4.1 POC
>>>>>>>>> > <https://github.com/apache/iceberg/pull/15311> and a local
>>>>>>>>> benchmark to
>>>>>>>>> > quantify both the pruning impact (plannedFiles → afterBloom) and
>>>>>>>>> the index
>>>>>>>>> > read overhead (statsFiles, statsBytes, bloomPayloadBytes) for
>>>>>>>>> point
>>>>>>>>> > predicates on high-cardinality columns. Please take a look and
>>>>>>>>> let me know
>>>>>>>>> > if you have any questions or feedback.
>>>>>>>>> >
>>>>>>>>> > Thanks,
>>>>>>>>> >
>>>>>>>>> > Huaxin
>>>>>>>>> >
>>>>>>>>> > On Tue, Feb 10, 2026 at 1:43 PM huaxin gao <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > > Reminder for tomorrow's sync on Iceberg Index Support.
>>>>>>>>> > >
>>>>>>>>> > > Wednesday: Feb. 11 9:00 – 10:00am
>>>>>>>>> > > Time zone: America/Los_Angeles
>>>>>>>>> > > Google Meet joining info
>>>>>>>>> > > Video call link: meet.google.com/nsp-ctyr-khk
>>>>>>>>> > > Design doc:
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>>>>> > >
>>>>>>>>> > > Thanks,
>>>>>>>>> > > Huaxin
>>>>>>>>> > >
>>>>>>>>> > >
>>>>>>>>> > > On Tue, Feb 3, 2026 at 10:52 PM Péter Váry <[email protected]>
>>>>>>>>> > > wrote:
>>>>>>>>> > >
>>>>>>>>> > >> Thanks Huaxin and Steven for organizing this. Looking forward
>>>>>>>>> to meet you
>>>>>>>>> > >> all next week!
>>>>>>>>> > >>
>>>>>>>>> > >> On Wed, Feb 4, 2026, 02:48 Steven Wu <[email protected]> wrote:
>>>>>>>>> > >>
>>>>>>>>> > >>> We set up the dev calendar event with a new google meet
>>>>>>>>> link. Please
>>>>>>>>> > >>> ignore the link from Huaxin's original email.
>>>>>>>>> > >>>
>>>>>>>>> > >>> The dev calendar has the correct info (including the new
>>>>>>>>> meeting link)
>>>>>>>>> > >>>
>>>>>>>>> > >>> Iceberg Index Support Sync
>>>>>>>>> > >>> Wednesday, February 11 · 9:00 – 10:00am
>>>>>>>>> > >>> Time zone: America/Los_Angeles
>>>>>>>>> > >>> Google Meet joining info
>>>>>>>>> > >>> Video call link: https://meet.google.com/nsp-ctyr-khk
>>>>>>>>> > >>>
>>>>>>>>> > >>> On Tue, Feb 3, 2026 at 5:08 PM huaxin gao <[email protected]>
>>>>>>>>> > >>> wrote:
>>>>>>>>> > >>>
>>>>>>>>> > >>>> Sorry, I meant PST (not EST) :)
>>>>>>>>> > >>>> Looking forward to the discussion!
>>>>>>>>> > >>>>
>>>>>>>>> > >>>> On Tue, Feb 3, 2026 at 4:58 PM Shawn Chang <[email protected]
>>>>>>>>> >
>>>>>>>>> > >>>> wrote:
>>>>>>>>> > >>>>
>>>>>>>>> > >>>>> Hi Huaxin,
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> Thanks for starting the sync!
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> The meeting seems to be 9-10AM PST on the dev events
>>>>>>>>> calendar
>>>>>>>>> > >>>>> <
>>>>>>>>> https://calendar.google.com/calendar/u/0?cid=MzkwNWQ0OTJmMWI0NTBiYTA3MTJmMmFlNmFmYTc2ZWI3NTdmMTNkODUyMjBjYzAzYWE0NTI3ODg1YWRjNTYyOUBncm91cC5jYWxlbmRhci5nb29nbGUuY29t
>>>>>>>>> >,
>>>>>>>>> > >>>>> not EST. Maybe it's a typo?
>>>>>>>>> > >>>>> Otherwise, looking forward to the discussion!
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> Best,
>>>>>>>>> > >>>>> Shawn
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>> On Tue, Feb 3, 2026 at 9:18 AM huaxin gao <[email protected]
>>>>>>>>> >
>>>>>>>>> > >>>>> wrote:
>>>>>>>>> > >>>>>
>>>>>>>>> > >>>>>> Hi all,
>>>>>>>>> > >>>>>> I'd like to start a dedicated sync to discuss Iceberg
>>>>>>>>> Index support.
>>>>>>>>> > >>>>>> Here is the existing discussion thread:
>>>>>>>>> > >>>>>>
>>>>>>>>> https://lists.apache.org/thread/fzqk3jjf0xpj5m4cfqb3v4c65p0t04ty.
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> To ground the discussion, here are the two proposals:
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> - Peter's proposal
>>>>>>>>> > >>>>>> <
>>>>>>>>> https://docs.google.com/document/d/1N6a2IOzC6Qsqv7NBqHKesees4N6WF49YUSIX2FrF7S0/edit?tab=t.0#heading=h.hs6r9d26w1y2>
>>>>>>>>> (overall
>>>>>>>>> > >>>>>> index support)
>>>>>>>>> > >>>>>> - My proposal
>>>>>>>>> > >>>>>> <
>>>>>>>>> https://docs.google.com/document/d/1x-0KT43aTrt8u6EV7EgSietIFQSkGsocqwnBTHPebRU/edit?tab=t.0#heading=h.qouk73o4jxx7
>>>>>>>>> >
>>>>>>>>> > >>>>>> (bloom filter skipping index)
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> Time slot: Every 3 weeks, Wednesdays at 9 AM to 10 AM
>>>>>>>>> EST, starting
>>>>>>>>> > >>>>>> next Wednesday (2/11). After FileFormat sync finishes, we
>>>>>>>>> plan to use that
>>>>>>>>> > >>>>>> slot and switch to every other Monday, 9 AM to 10 AM EST.
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> Meet link: https://meet.google.com/fjn-tyze-mko
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>> Thanks,
>>>>>>>>> > >>>>>> Huaxin
>>>>>>>>> > >>>>>>
>>>>>>>>> > >>>>>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>

Re: Re: Dedicated sync for Iceberg Index Support

Reply via email to