Hi Jingsong, Strong +1 from my side. This would be valuable and help remove the external lumina-jni dependency. Happy to support with dev efforts and review.
Warm Regards, Arnav On Thu, Jun 4, 2026 at 1:54 PM wang <[email protected]> wrote: > Hi Jingsong, > > Thanks for the detailed explanation! This makes a lot of sense to me. > > Looking forward to this feature! I'm very willing to contribute on this! > > Best Xinyu Wang > > > On Thu, Jun 4, 2026 at 4:11 PM Jingsong Li <[email protected]> wrote: > > > Hi Wang, > > > > Thanks for the excellent questions! These are exactly the right things > > to think about early. Let me address each one. > > > > The short answer is: the first version focuses on a per-data-file > > local index, but the design intentionally separates the "training > > artifacts" (centroids, codebooks, OPQ rotation) from the "index data" > > (inverted lists), which is the key prerequisite for distributed > > indexing. > > > > ## Q1: Will IVF-PQ use globally trained centroids/codebooks? > > > > Yes, this is the intended direction. The architecture naturally > > supports a two-phase approach: > > > > **Phase 1 (current):** Training and indexing happen together per data > > file. Each file trains its own centroids and PQ codebooks. This is > > simple and already works. > > > > **Phase 2 (future):** Separate training from indexing: > > > > ``` > > Global Training (once per table/partition): > > Sample vectors from multiple files → train centroids + PQ codebooks > > + OPQ rotation > > → Store as a shared "training artifact" file > > > > Distributed Build (per worker): > > Load shared training artifacts > > → Assign vectors to global IVF lists > > → Encode residuals with shared PQ codebooks > > → Write index file (only contains inverted list data, references > > shared artifacts) > > ``` > > > > The file format already supports this separation — centroids and > > codebooks are stored in a well-defined section of the file. In Phase > > 2, we can either: > > - Embed the shared artifacts in each index file (simple, some redundancy) > > - Or reference an external artifact file via metadata (no redundancy, > > slightly more complex) > > > > The `IVFPQIndexMeta` (stored in `GlobalIndexIOMeta.metadata()`) can be > > extended with a `training_artifact_path` field without breaking > > backward compatibility. > > > > ## Q2: How will distributed index building work? > > > > Paimon's existing architecture already provides the distributed > > skeleton. The key insight is that IVF-PQ's inverted lists are > > **partitionable by list ID**: > > > > ``` > > Global IVF lists: [0, 1, 2, ..., nlist-1] > > > > Worker 0 writes: index-file-0.ivfpq → contains lists {0, 3, 7, 12, ...} > > Worker 1 writes: index-file-1.ivfpq → contains lists {1, 4, 8, 15, ...} > > Worker 2 writes: index-file-2.ivfpq → contains lists {2, 5, 9, 11, ...} > > ``` > > > > Each worker writes an independent index file. **No physical merge is > > needed** — Paimon's snapshot metadata already tracks multiple index > > files per partition/bucket. The query layer reads the relevant files > > based on metadata. > > > > This is analogous to how BTree global index already works: BTree > > stores `firstKey`/`lastKey` in `BTreeIndexMeta` to enable file pruning > > via `BTreeFileMetaSelector`. For IVF-PQ, we can store the **list ID > > set** (which IVF lists this file contains) in `IVFPQIndexMeta`: > > > > ``` > > BTree metadata: {firstKey, lastKey} → prune files by key range > > IVF-PQ metadata: {listIds: [3,7,12,...]} → prune files by IVF list > > assignment > > ``` > > > > The `GlobalIndexIOMeta` system supports this naturally — each > > `ResultEntry` already carries a `byte[] meta` field. We just need to > > serialize the list ID set there. > > > > Logical merging through metadata is sufficient for reads. Physical > > compaction (merging multiple index files into one) can be a background > > optimization, similar to how Paimon compacts data files. > > > > ## Q3: How will query routing work? > > > > With the metadata described above, the query flow for distributed IVF-PQ > > is: > > > > ``` > > 1. Compute top-nprobe IVF list IDs from query vector (using shared > > centroids) > > 2. For each list ID, look up which index files contain that list (via > > metadata) > > 3. Read only those files, only the relevant inverted lists within each > file > > 4. Merge results across files using a global top-k heap > > ``` > > > > Step 2 is the "file selector" step, directly analogous to > > `BTreeFileMetaSelector`: > > > > ```java > > // BTree: select files whose key range overlaps the query > > BTreeFileMetaSelector → visitEqual(key) → filter by firstKey/lastKey > > > > // IVF-PQ: select files whose list IDs overlap the probed lists > > IVFPQFileMetaSelector → visitVectorSearch(query) → filter by listIds > > ``` > > > > This means the query only reads the minimum number of files and the > > minimum number of inverted lists within each file — two levels of > > pruning. > > > > ## Q4: Incremental index build and compaction? > > > > **Incremental build:** When new data files are added, a new index file > > is built for the new data using the shared training artifacts (same > > centroids, same PQ codebooks). No need to rebuild existing index > > files. The new file is registered in the snapshot metadata alongside > > existing index files. > > > > **Index compaction:** Multiple small index files can be merged into > > larger ones as a background optimization. Since inverted lists are > > independent, this is a simple concatenation per list ID — no > > retraining needed. > > > > **Re-training:** When data distribution drifts significantly, the > > shared training artifacts can be retrained from fresh samples and all > > index files rebuilt. This is a heavy operation but infrequent (similar > > to how partition statistics are refreshed). > > > > ## Summary > > > > The critical point: the file format and metadata abstraction in V1 > > should accommodate the distributed case. The centroids/codebooks > > section is clearly separated from inverted list data, and > > `IVFPQIndexMeta` is extensible. We don't need to redesign the format > > when adding distributed support — we just need to add the > > orchestration layer (global training coordinator, file selector, > > multi-file merger). > > > > Best, > > Jingsong Li > > > > On Thu, Jun 4, 2026 at 4:00 PM wang <[email protected]> wrote: > > > > > > Hi Jingsong, > > > > > > I'm really excited about this feature! A native IVF-PQ implementation > > > sounds very valuable for Paimon, especially if we can make it > lake-native > > > and avoid external native dependencies. > > > > > > One question I have is whether the current proposal is mainly focused > on > > > introducing a local vector index library, or whether it is also > intended > > to > > > lay the groundwork for a more distributed vector index framework in > > Paimon, > > > similar to Lance's approach. > > > > > > If we want to support large-scale distributed vector indexes in the > > future, > > > I think there are several design questions that may affect the index > file > > > format, metadata, and build/query workflow: > > > > > > 1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids, > > and > > > PQ codebooks? For example, in Lance, global IVF centroids are > trained > > from > > > global samples before workers build the actual index files, and PQ > > > codebooks are also generated before distributed index writing. > > > 2. How will distributed index building work? Will each worker write > > > independent index files or index segments, and then commit them > > through a > > > global index manifest/catalog? Do we need to merge these files > > physically, > > > or is logical merging through metadata enough? > > > 3. How will query routing work in a distributed setup? For example, > > > after finding the top nprobe global IVF lists for a query, how do we > > map > > > these list IDs to physical index files or workers? > > > 4. Do we plan to support incremental index build and index > compaction? > > > > > > I don't think all of these need to be solved in the first version, but > it > > > would be great if the proposal could clarify the long-term direction. > My > > > main concern is that these distributed/snapshot-aware requirements may > > > influence the index library abstraction and file format from the > > beginning. > > > > > > On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]> > > wrote: > > > > > > > Hi devs, > > > > > > > > I'd like to propose adding a native IVF-PQ (Inverted File with > Product > > > > Quantization) vector index to Paimon. I have a working prototype and > > > > would like to get community feedback on the direction before opening > a > > > > formal proposal. > > > > > > > > ## Motivation > > > > > > > > Paimon currently supports vector search via Lumina (DiskANN), which > > > > depends on an external native library (`lumina-jni` from Alibaba > > > > Cloud). While DiskANN is a great algorithm, the external dependency > > > > brings several challenges: > > > > > > > > 1. **Availability**: The library is hosted on a private Maven > > > > repository, not on Maven Central. This creates a supply chain risk > for > > > > the open-source project. > > > > 2. **Single algorithm**: DiskANN is graph-based and has its > strengths, > > > > but IVF-PQ is the industry standard for large-scale ANN with good > > > > accuracy-speed-memory trade-offs, especially when combined with OPQ. > > > > > > > > ## Why not just use Faiss? > > > > > > > > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's > > > > architecture: > > > > 1. **No InputStream abstraction**: Faiss's `read_index` / > > > > `write_index` operates on local files or memory buffers. There is no > > > > way to plug in a custom I/O layer. In Paimon, index files live on > > > > object storage (S3/HDFS/OSS) and are accessed via > > > > `SeekableInputStream`. To use Faiss, we would have to download the > > > > entire index file to local disk first — defeating the purpose of a > > > > data lake architecture. > > > > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI, > > > > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers. > > > > Redirecting these to Java's `SeekableInputStream` would require > > > > intercepting every low-level C read call, which is fragile and has > > > > high per-call overhead. Our Rust implementation uses a `SeekRead` > > > > trait that maps cleanly to JNI callbacks at the inverted-list > > > > granularity (a few bulk reads per query, not thousands of small > > > > reads). > > > > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a > > > > large C++ framework (~200K lines) with GPU support, multi-index > > > > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ > > > > search path. A focused Rust implementation covers this in ~3K lines, > > > > is easier to maintain, and has zero system library requirements (no > > > > BLAS, LAPACK, or OpenMP). > > > > 4. **Cross-language consistency**: Faiss has separate C++ and Python > > > > interfaces but no Java interface. With Rust, one codebase produces > > > > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings, > > > > guaranteeing identical behavior and file format compatibility. > > > > > > > > Our implementation aligns with Faiss's algorithmic details (ADC, > > > > precomputed tables, OPQ, k-means++ empty cluster handling) so the > > > > search quality is equivalent — we just own the I/O layer. > > > > > > > > ## Proposal > > > > > > > > Build a **pure Rust IVF-PQ** implementation inside the Paimon > > > > repository, with JNI bindings for Java and PyO3 bindings for Python. > > > > Key design goals: > > > > > > > > - **SeekableInputStream native**: The file format is designed for > > > > remote storage — offset table precedes inverted list data, so queries > > > > only read `nprobe` lists via seek+read, not the entire file. > > > > - **Faiss-aligned algorithms**: Same ADC with precomputed distance > > > > tables, residual encoding, k-means++ with Faiss-style empty cluster > > > > handling, OPQ rotation via Procrustes+SVD. > > > > - **Zero external dependency**: The core library uses only pure Rust > > > > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS, > > > > LAPACK, or OpenMP required. > > > > - **One codebase, two languages**: A single Rust core serves both the > > > > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent > > > > behavior. > > > > > > > > We need to create a new `paimon-vector-index` sub-project repository. > > > > > > > > What do you think? > > > > > > > > Best, > > > > Jingsong > > > > > > >
