Hi Jingsong,

Strong +1 from my side. This would be valuable and help remove the
external lumina-jni dependency.
Happy to support with dev efforts and review.

Warm Regards,
Arnav

On Thu, Jun 4, 2026 at 1:54 PM wang <[email protected]> wrote:

> Hi Jingsong,
>
> Thanks for the detailed explanation! This makes a lot of sense to me.
>
> Looking forward to this feature! I'm very willing to contribute on this!
>
> Best Xinyu Wang
>
>
> On Thu, Jun 4, 2026 at 4:11 PM Jingsong Li <[email protected]> wrote:
>
> > Hi Wang,
> >
> > Thanks for the excellent questions! These are exactly the right things
> > to think about early. Let me address each one.
> >
> > The short answer is: the first version focuses on a per-data-file
> > local index, but the design intentionally separates the "training
> > artifacts" (centroids, codebooks, OPQ rotation) from the "index data"
> > (inverted lists), which is the key prerequisite for distributed
> > indexing.
> >
> > ## Q1: Will IVF-PQ use globally trained centroids/codebooks?
> >
> > Yes, this is the intended direction. The architecture naturally
> > supports a two-phase approach:
> >
> > **Phase 1 (current):** Training and indexing happen together per data
> > file. Each file trains its own centroids and PQ codebooks. This is
> > simple and already works.
> >
> > **Phase 2 (future):** Separate training from indexing:
> >
> > ```
> > Global Training (once per table/partition):
> >   Sample vectors from multiple files → train centroids + PQ codebooks
> > + OPQ rotation
> >   → Store as a shared "training artifact" file
> >
> > Distributed Build (per worker):
> >   Load shared training artifacts
> >   → Assign vectors to global IVF lists
> >   → Encode residuals with shared PQ codebooks
> >   → Write index file (only contains inverted list data, references
> > shared artifacts)
> > ```
> >
> > The file format already supports this separation — centroids and
> > codebooks are stored in a well-defined section of the file. In Phase
> > 2, we can either:
> > - Embed the shared artifacts in each index file (simple, some redundancy)
> > - Or reference an external artifact file via metadata (no redundancy,
> > slightly more complex)
> >
> > The `IVFPQIndexMeta` (stored in `GlobalIndexIOMeta.metadata()`) can be
> > extended with a `training_artifact_path` field without breaking
> > backward compatibility.
> >
> > ## Q2: How will distributed index building work?
> >
> > Paimon's existing architecture already provides the distributed
> > skeleton. The key insight is that IVF-PQ's inverted lists are
> > **partitionable by list ID**:
> >
> > ```
> > Global IVF lists: [0, 1, 2, ..., nlist-1]
> >
> > Worker 0 writes: index-file-0.ivfpq → contains lists {0, 3, 7, 12, ...}
> > Worker 1 writes: index-file-1.ivfpq → contains lists {1, 4, 8, 15, ...}
> > Worker 2 writes: index-file-2.ivfpq → contains lists {2, 5, 9, 11, ...}
> > ```
> >
> > Each worker writes an independent index file. **No physical merge is
> > needed** — Paimon's snapshot metadata already tracks multiple index
> > files per partition/bucket. The query layer reads the relevant files
> > based on metadata.
> >
> > This is analogous to how BTree global index already works: BTree
> > stores `firstKey`/`lastKey` in `BTreeIndexMeta` to enable file pruning
> > via `BTreeFileMetaSelector`. For IVF-PQ, we can store the **list ID
> > set** (which IVF lists this file contains) in `IVFPQIndexMeta`:
> >
> > ```
> > BTree metadata:    {firstKey, lastKey}     → prune files by key range
> > IVF-PQ metadata:   {listIds: [3,7,12,...]} → prune files by IVF list
> > assignment
> > ```
> >
> > The `GlobalIndexIOMeta` system supports this naturally — each
> > `ResultEntry` already carries a `byte[] meta` field. We just need to
> > serialize the list ID set there.
> >
> > Logical merging through metadata is sufficient for reads. Physical
> > compaction (merging multiple index files into one) can be a background
> > optimization, similar to how Paimon compacts data files.
> >
> > ## Q3: How will query routing work?
> >
> > With the metadata described above, the query flow for distributed IVF-PQ
> > is:
> >
> > ```
> > 1. Compute top-nprobe IVF list IDs from query vector (using shared
> > centroids)
> > 2. For each list ID, look up which index files contain that list (via
> > metadata)
> > 3. Read only those files, only the relevant inverted lists within each
> file
> > 4. Merge results across files using a global top-k heap
> > ```
> >
> > Step 2 is the "file selector" step, directly analogous to
> > `BTreeFileMetaSelector`:
> >
> > ```java
> > // BTree: select files whose key range overlaps the query
> > BTreeFileMetaSelector → visitEqual(key) → filter by firstKey/lastKey
> >
> > // IVF-PQ: select files whose list IDs overlap the probed lists
> > IVFPQFileMetaSelector → visitVectorSearch(query) → filter by listIds
> > ```
> >
> > This means the query only reads the minimum number of files and the
> > minimum number of inverted lists within each file — two levels of
> > pruning.
> >
> > ## Q4: Incremental index build and compaction?
> >
> > **Incremental build:** When new data files are added, a new index file
> > is built for the new data using the shared training artifacts (same
> > centroids, same PQ codebooks). No need to rebuild existing index
> > files. The new file is registered in the snapshot metadata alongside
> > existing index files.
> >
> > **Index compaction:** Multiple small index files can be merged into
> > larger ones as a background optimization. Since inverted lists are
> > independent, this is a simple concatenation per list ID — no
> > retraining needed.
> >
> > **Re-training:** When data distribution drifts significantly, the
> > shared training artifacts can be retrained from fresh samples and all
> > index files rebuilt. This is a heavy operation but infrequent (similar
> > to how partition statistics are refreshed).
> >
> > ## Summary
> >
> > The critical point: the file format and metadata abstraction in V1
> > should accommodate the distributed case. The centroids/codebooks
> > section is clearly separated from inverted list data, and
> > `IVFPQIndexMeta` is extensible. We don't need to redesign the format
> > when adding distributed support — we just need to add the
> > orchestration layer (global training coordinator, file selector,
> > multi-file merger).
> >
> > Best,
> > Jingsong Li
> >
> > On Thu, Jun 4, 2026 at 4:00 PM wang <[email protected]> wrote:
> > >
> > > Hi Jingsong,
> > >
> > > I'm really excited about this feature! A native IVF-PQ implementation
> > > sounds very valuable for Paimon, especially if we can make it
> lake-native
> > > and avoid external native dependencies.
> > >
> > > One question I have is whether the current proposal is mainly focused
> on
> > > introducing a local vector index library, or whether it is also
> intended
> > to
> > > lay the groundwork for a more distributed vector index framework in
> > Paimon,
> > > similar to Lance's approach.
> > >
> > > If we want to support large-scale distributed vector indexes in the
> > future,
> > > I think there are several design questions that may affect the index
> file
> > > format, metadata, and build/query workflow:
> > >
> > >    1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids,
> > and
> > >    PQ codebooks? For example, in Lance, global IVF centroids are
> trained
> > from
> > >    global samples before workers build the actual index files, and PQ
> > >    codebooks are also generated before distributed index writing.
> > >    2. How will distributed index building work? Will each worker write
> > >    independent index files or index segments, and then commit them
> > through a
> > >    global index manifest/catalog? Do we need to merge these files
> > physically,
> > >    or is logical merging through metadata enough?
> > >    3. How will query routing work in a distributed setup? For example,
> > >    after finding the top nprobe global IVF lists for a query, how do we
> > map
> > >    these list IDs to physical index files or workers?
> > >    4. Do we plan to support incremental index build and index
> compaction?
> > >
> > > I don't think all of these need to be solved in the first version, but
> it
> > > would be great if the proposal could clarify the long-term direction.
> My
> > > main concern is that these distributed/snapshot-aware requirements may
> > > influence the index library abstraction and file format from the
> > beginning.
> > >
> > > On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > Hi devs,
> > > >
> > > > I'd like to propose adding a native IVF-PQ (Inverted File with
> Product
> > > > Quantization) vector index to Paimon. I have a working prototype and
> > > > would like to get community feedback on the direction before opening
> a
> > > > formal proposal.
> > > >
> > > > ## Motivation
> > > >
> > > > Paimon currently supports vector search via Lumina (DiskANN), which
> > > > depends on an external native library (`lumina-jni` from Alibaba
> > > > Cloud). While DiskANN is a great algorithm, the external dependency
> > > > brings several challenges:
> > > >
> > > > 1. **Availability**: The library is hosted on a private Maven
> > > > repository, not on Maven Central. This creates a supply chain risk
> for
> > > > the open-source project.
> > > > 2. **Single algorithm**: DiskANN is graph-based and has its
> strengths,
> > > > but IVF-PQ is the industry standard for large-scale ANN with good
> > > > accuracy-speed-memory trade-offs, especially when combined with OPQ.
> > > >
> > > > ## Why not just use Faiss?
> > > >
> > > > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's
> > > > architecture:
> > > > 1. **No InputStream abstraction**: Faiss's `read_index` /
> > > > `write_index` operates on local files or memory buffers. There is no
> > > > way to plug in a custom I/O layer. In Paimon, index files live on
> > > > object storage (S3/HDFS/OSS) and are accessed via
> > > > `SeekableInputStream`. To use Faiss, we would have to download the
> > > > entire index file to local disk first — defeating the purpose of a
> > > > data lake architecture.
> > > > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI,
> > > > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers.
> > > > Redirecting these to Java's `SeekableInputStream` would require
> > > > intercepting every low-level C read call, which is fragile and has
> > > > high per-call overhead. Our Rust implementation uses a `SeekRead`
> > > > trait that maps cleanly to JNI callbacks at the inverted-list
> > > > granularity (a few bulk reads per query, not thousands of small
> > > > reads).
> > > > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a
> > > > large C++ framework (~200K lines) with GPU support, multi-index
> > > > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ
> > > > search path. A focused Rust implementation covers this in ~3K lines,
> > > > is easier to maintain, and has zero system library requirements (no
> > > > BLAS, LAPACK, or OpenMP).
> > > > 4. **Cross-language consistency**: Faiss has separate C++ and Python
> > > > interfaces but no Java interface. With Rust, one codebase produces
> > > > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings,
> > > > guaranteeing identical behavior and file format compatibility.
> > > >
> > > > Our implementation aligns with Faiss's algorithmic details (ADC,
> > > > precomputed tables, OPQ, k-means++ empty cluster handling) so the
> > > > search quality is equivalent — we just own the I/O layer.
> > > >
> > > > ## Proposal
> > > >
> > > > Build a **pure Rust IVF-PQ** implementation inside the Paimon
> > > > repository, with JNI bindings for Java and PyO3 bindings for Python.
> > > > Key design goals:
> > > >
> > > > - **SeekableInputStream native**: The file format is designed for
> > > > remote storage — offset table precedes inverted list data, so queries
> > > > only read `nprobe` lists via seek+read, not the entire file.
> > > > - **Faiss-aligned algorithms**: Same ADC with precomputed distance
> > > > tables, residual encoding, k-means++ with Faiss-style empty cluster
> > > > handling, OPQ rotation via Procrustes+SVD.
> > > > - **Zero external dependency**: The core library uses only pure Rust
> > > > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS,
> > > > LAPACK, or OpenMP required.
> > > > - **One codebase, two languages**: A single Rust core serves both the
> > > > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent
> > > > behavior.
> > > >
> > > > We need to create a new `paimon-vector-index` sub-project repository.
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> >
>

Reply via email to