Hi Jingsong,

Thanks for the detailed explanation! This makes a lot of sense to me.

Looking forward to this feature! I'm very willing to contribute on this!

Best Xinyu Wang


On Thu, Jun 4, 2026 at 4:11 PM Jingsong Li <[email protected]> wrote:

> Hi Wang,
>
> Thanks for the excellent questions! These are exactly the right things
> to think about early. Let me address each one.
>
> The short answer is: the first version focuses on a per-data-file
> local index, but the design intentionally separates the "training
> artifacts" (centroids, codebooks, OPQ rotation) from the "index data"
> (inverted lists), which is the key prerequisite for distributed
> indexing.
>
> ## Q1: Will IVF-PQ use globally trained centroids/codebooks?
>
> Yes, this is the intended direction. The architecture naturally
> supports a two-phase approach:
>
> **Phase 1 (current):** Training and indexing happen together per data
> file. Each file trains its own centroids and PQ codebooks. This is
> simple and already works.
>
> **Phase 2 (future):** Separate training from indexing:
>
> ```
> Global Training (once per table/partition):
>   Sample vectors from multiple files → train centroids + PQ codebooks
> + OPQ rotation
>   → Store as a shared "training artifact" file
>
> Distributed Build (per worker):
>   Load shared training artifacts
>   → Assign vectors to global IVF lists
>   → Encode residuals with shared PQ codebooks
>   → Write index file (only contains inverted list data, references
> shared artifacts)
> ```
>
> The file format already supports this separation — centroids and
> codebooks are stored in a well-defined section of the file. In Phase
> 2, we can either:
> - Embed the shared artifacts in each index file (simple, some redundancy)
> - Or reference an external artifact file via metadata (no redundancy,
> slightly more complex)
>
> The `IVFPQIndexMeta` (stored in `GlobalIndexIOMeta.metadata()`) can be
> extended with a `training_artifact_path` field without breaking
> backward compatibility.
>
> ## Q2: How will distributed index building work?
>
> Paimon's existing architecture already provides the distributed
> skeleton. The key insight is that IVF-PQ's inverted lists are
> **partitionable by list ID**:
>
> ```
> Global IVF lists: [0, 1, 2, ..., nlist-1]
>
> Worker 0 writes: index-file-0.ivfpq → contains lists {0, 3, 7, 12, ...}
> Worker 1 writes: index-file-1.ivfpq → contains lists {1, 4, 8, 15, ...}
> Worker 2 writes: index-file-2.ivfpq → contains lists {2, 5, 9, 11, ...}
> ```
>
> Each worker writes an independent index file. **No physical merge is
> needed** — Paimon's snapshot metadata already tracks multiple index
> files per partition/bucket. The query layer reads the relevant files
> based on metadata.
>
> This is analogous to how BTree global index already works: BTree
> stores `firstKey`/`lastKey` in `BTreeIndexMeta` to enable file pruning
> via `BTreeFileMetaSelector`. For IVF-PQ, we can store the **list ID
> set** (which IVF lists this file contains) in `IVFPQIndexMeta`:
>
> ```
> BTree metadata:    {firstKey, lastKey}     → prune files by key range
> IVF-PQ metadata:   {listIds: [3,7,12,...]} → prune files by IVF list
> assignment
> ```
>
> The `GlobalIndexIOMeta` system supports this naturally — each
> `ResultEntry` already carries a `byte[] meta` field. We just need to
> serialize the list ID set there.
>
> Logical merging through metadata is sufficient for reads. Physical
> compaction (merging multiple index files into one) can be a background
> optimization, similar to how Paimon compacts data files.
>
> ## Q3: How will query routing work?
>
> With the metadata described above, the query flow for distributed IVF-PQ
> is:
>
> ```
> 1. Compute top-nprobe IVF list IDs from query vector (using shared
> centroids)
> 2. For each list ID, look up which index files contain that list (via
> metadata)
> 3. Read only those files, only the relevant inverted lists within each file
> 4. Merge results across files using a global top-k heap
> ```
>
> Step 2 is the "file selector" step, directly analogous to
> `BTreeFileMetaSelector`:
>
> ```java
> // BTree: select files whose key range overlaps the query
> BTreeFileMetaSelector → visitEqual(key) → filter by firstKey/lastKey
>
> // IVF-PQ: select files whose list IDs overlap the probed lists
> IVFPQFileMetaSelector → visitVectorSearch(query) → filter by listIds
> ```
>
> This means the query only reads the minimum number of files and the
> minimum number of inverted lists within each file — two levels of
> pruning.
>
> ## Q4: Incremental index build and compaction?
>
> **Incremental build:** When new data files are added, a new index file
> is built for the new data using the shared training artifacts (same
> centroids, same PQ codebooks). No need to rebuild existing index
> files. The new file is registered in the snapshot metadata alongside
> existing index files.
>
> **Index compaction:** Multiple small index files can be merged into
> larger ones as a background optimization. Since inverted lists are
> independent, this is a simple concatenation per list ID — no
> retraining needed.
>
> **Re-training:** When data distribution drifts significantly, the
> shared training artifacts can be retrained from fresh samples and all
> index files rebuilt. This is a heavy operation but infrequent (similar
> to how partition statistics are refreshed).
>
> ## Summary
>
> The critical point: the file format and metadata abstraction in V1
> should accommodate the distributed case. The centroids/codebooks
> section is clearly separated from inverted list data, and
> `IVFPQIndexMeta` is extensible. We don't need to redesign the format
> when adding distributed support — we just need to add the
> orchestration layer (global training coordinator, file selector,
> multi-file merger).
>
> Best,
> Jingsong Li
>
> On Thu, Jun 4, 2026 at 4:00 PM wang <[email protected]> wrote:
> >
> > Hi Jingsong,
> >
> > I'm really excited about this feature! A native IVF-PQ implementation
> > sounds very valuable for Paimon, especially if we can make it lake-native
> > and avoid external native dependencies.
> >
> > One question I have is whether the current proposal is mainly focused on
> > introducing a local vector index library, or whether it is also intended
> to
> > lay the groundwork for a more distributed vector index framework in
> Paimon,
> > similar to Lance's approach.
> >
> > If we want to support large-scale distributed vector indexes in the
> future,
> > I think there are several design questions that may affect the index file
> > format, metadata, and build/query workflow:
> >
> >    1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids,
> and
> >    PQ codebooks? For example, in Lance, global IVF centroids are trained
> from
> >    global samples before workers build the actual index files, and PQ
> >    codebooks are also generated before distributed index writing.
> >    2. How will distributed index building work? Will each worker write
> >    independent index files or index segments, and then commit them
> through a
> >    global index manifest/catalog? Do we need to merge these files
> physically,
> >    or is logical merging through metadata enough?
> >    3. How will query routing work in a distributed setup? For example,
> >    after finding the top nprobe global IVF lists for a query, how do we
> map
> >    these list IDs to physical index files or workers?
> >    4. Do we plan to support incremental index build and index compaction?
> >
> > I don't think all of these need to be solved in the first version, but it
> > would be great if the proposal could clarify the long-term direction. My
> > main concern is that these distributed/snapshot-aware requirements may
> > influence the index library abstraction and file format from the
> beginning.
> >
> > On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]>
> wrote:
> >
> > > Hi devs,
> > >
> > > I'd like to propose adding a native IVF-PQ (Inverted File with Product
> > > Quantization) vector index to Paimon. I have a working prototype and
> > > would like to get community feedback on the direction before opening a
> > > formal proposal.
> > >
> > > ## Motivation
> > >
> > > Paimon currently supports vector search via Lumina (DiskANN), which
> > > depends on an external native library (`lumina-jni` from Alibaba
> > > Cloud). While DiskANN is a great algorithm, the external dependency
> > > brings several challenges:
> > >
> > > 1. **Availability**: The library is hosted on a private Maven
> > > repository, not on Maven Central. This creates a supply chain risk for
> > > the open-source project.
> > > 2. **Single algorithm**: DiskANN is graph-based and has its strengths,
> > > but IVF-PQ is the industry standard for large-scale ANN with good
> > > accuracy-speed-memory trade-offs, especially when combined with OPQ.
> > >
> > > ## Why not just use Faiss?
> > >
> > > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's
> > > architecture:
> > > 1. **No InputStream abstraction**: Faiss's `read_index` /
> > > `write_index` operates on local files or memory buffers. There is no
> > > way to plug in a custom I/O layer. In Paimon, index files live on
> > > object storage (S3/HDFS/OSS) and are accessed via
> > > `SeekableInputStream`. To use Faiss, we would have to download the
> > > entire index file to local disk first — defeating the purpose of a
> > > data lake architecture.
> > > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI,
> > > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers.
> > > Redirecting these to Java's `SeekableInputStream` would require
> > > intercepting every low-level C read call, which is fragile and has
> > > high per-call overhead. Our Rust implementation uses a `SeekRead`
> > > trait that maps cleanly to JNI callbacks at the inverted-list
> > > granularity (a few bulk reads per query, not thousands of small
> > > reads).
> > > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a
> > > large C++ framework (~200K lines) with GPU support, multi-index
> > > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ
> > > search path. A focused Rust implementation covers this in ~3K lines,
> > > is easier to maintain, and has zero system library requirements (no
> > > BLAS, LAPACK, or OpenMP).
> > > 4. **Cross-language consistency**: Faiss has separate C++ and Python
> > > interfaces but no Java interface. With Rust, one codebase produces
> > > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings,
> > > guaranteeing identical behavior and file format compatibility.
> > >
> > > Our implementation aligns with Faiss's algorithmic details (ADC,
> > > precomputed tables, OPQ, k-means++ empty cluster handling) so the
> > > search quality is equivalent — we just own the I/O layer.
> > >
> > > ## Proposal
> > >
> > > Build a **pure Rust IVF-PQ** implementation inside the Paimon
> > > repository, with JNI bindings for Java and PyO3 bindings for Python.
> > > Key design goals:
> > >
> > > - **SeekableInputStream native**: The file format is designed for
> > > remote storage — offset table precedes inverted list data, so queries
> > > only read `nprobe` lists via seek+read, not the entire file.
> > > - **Faiss-aligned algorithms**: Same ADC with precomputed distance
> > > tables, residual encoding, k-means++ with Faiss-style empty cluster
> > > handling, OPQ rotation via Procrustes+SVD.
> > > - **Zero external dependency**: The core library uses only pure Rust
> > > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS,
> > > LAPACK, or OpenMP required.
> > > - **One codebase, two languages**: A single Rust core serves both the
> > > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent
> > > behavior.
> > >
> > > We need to create a new `paimon-vector-index` sub-project repository.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Jingsong
> > >
>

Reply via email to