Hi Jingsong, Thanks for the detailed explanation! This makes a lot of sense to me.
Looking forward to this feature! I'm very willing to contribute on this! Best Xinyu Wang On Thu, Jun 4, 2026 at 4:11 PM Jingsong Li <[email protected]> wrote: > Hi Wang, > > Thanks for the excellent questions! These are exactly the right things > to think about early. Let me address each one. > > The short answer is: the first version focuses on a per-data-file > local index, but the design intentionally separates the "training > artifacts" (centroids, codebooks, OPQ rotation) from the "index data" > (inverted lists), which is the key prerequisite for distributed > indexing. > > ## Q1: Will IVF-PQ use globally trained centroids/codebooks? > > Yes, this is the intended direction. The architecture naturally > supports a two-phase approach: > > **Phase 1 (current):** Training and indexing happen together per data > file. Each file trains its own centroids and PQ codebooks. This is > simple and already works. > > **Phase 2 (future):** Separate training from indexing: > > ``` > Global Training (once per table/partition): > Sample vectors from multiple files → train centroids + PQ codebooks > + OPQ rotation > → Store as a shared "training artifact" file > > Distributed Build (per worker): > Load shared training artifacts > → Assign vectors to global IVF lists > → Encode residuals with shared PQ codebooks > → Write index file (only contains inverted list data, references > shared artifacts) > ``` > > The file format already supports this separation — centroids and > codebooks are stored in a well-defined section of the file. In Phase > 2, we can either: > - Embed the shared artifacts in each index file (simple, some redundancy) > - Or reference an external artifact file via metadata (no redundancy, > slightly more complex) > > The `IVFPQIndexMeta` (stored in `GlobalIndexIOMeta.metadata()`) can be > extended with a `training_artifact_path` field without breaking > backward compatibility. > > ## Q2: How will distributed index building work? > > Paimon's existing architecture already provides the distributed > skeleton. The key insight is that IVF-PQ's inverted lists are > **partitionable by list ID**: > > ``` > Global IVF lists: [0, 1, 2, ..., nlist-1] > > Worker 0 writes: index-file-0.ivfpq → contains lists {0, 3, 7, 12, ...} > Worker 1 writes: index-file-1.ivfpq → contains lists {1, 4, 8, 15, ...} > Worker 2 writes: index-file-2.ivfpq → contains lists {2, 5, 9, 11, ...} > ``` > > Each worker writes an independent index file. **No physical merge is > needed** — Paimon's snapshot metadata already tracks multiple index > files per partition/bucket. The query layer reads the relevant files > based on metadata. > > This is analogous to how BTree global index already works: BTree > stores `firstKey`/`lastKey` in `BTreeIndexMeta` to enable file pruning > via `BTreeFileMetaSelector`. For IVF-PQ, we can store the **list ID > set** (which IVF lists this file contains) in `IVFPQIndexMeta`: > > ``` > BTree metadata: {firstKey, lastKey} → prune files by key range > IVF-PQ metadata: {listIds: [3,7,12,...]} → prune files by IVF list > assignment > ``` > > The `GlobalIndexIOMeta` system supports this naturally — each > `ResultEntry` already carries a `byte[] meta` field. We just need to > serialize the list ID set there. > > Logical merging through metadata is sufficient for reads. Physical > compaction (merging multiple index files into one) can be a background > optimization, similar to how Paimon compacts data files. > > ## Q3: How will query routing work? > > With the metadata described above, the query flow for distributed IVF-PQ > is: > > ``` > 1. Compute top-nprobe IVF list IDs from query vector (using shared > centroids) > 2. For each list ID, look up which index files contain that list (via > metadata) > 3. Read only those files, only the relevant inverted lists within each file > 4. Merge results across files using a global top-k heap > ``` > > Step 2 is the "file selector" step, directly analogous to > `BTreeFileMetaSelector`: > > ```java > // BTree: select files whose key range overlaps the query > BTreeFileMetaSelector → visitEqual(key) → filter by firstKey/lastKey > > // IVF-PQ: select files whose list IDs overlap the probed lists > IVFPQFileMetaSelector → visitVectorSearch(query) → filter by listIds > ``` > > This means the query only reads the minimum number of files and the > minimum number of inverted lists within each file — two levels of > pruning. > > ## Q4: Incremental index build and compaction? > > **Incremental build:** When new data files are added, a new index file > is built for the new data using the shared training artifacts (same > centroids, same PQ codebooks). No need to rebuild existing index > files. The new file is registered in the snapshot metadata alongside > existing index files. > > **Index compaction:** Multiple small index files can be merged into > larger ones as a background optimization. Since inverted lists are > independent, this is a simple concatenation per list ID — no > retraining needed. > > **Re-training:** When data distribution drifts significantly, the > shared training artifacts can be retrained from fresh samples and all > index files rebuilt. This is a heavy operation but infrequent (similar > to how partition statistics are refreshed). > > ## Summary > > The critical point: the file format and metadata abstraction in V1 > should accommodate the distributed case. The centroids/codebooks > section is clearly separated from inverted list data, and > `IVFPQIndexMeta` is extensible. We don't need to redesign the format > when adding distributed support — we just need to add the > orchestration layer (global training coordinator, file selector, > multi-file merger). > > Best, > Jingsong Li > > On Thu, Jun 4, 2026 at 4:00 PM wang <[email protected]> wrote: > > > > Hi Jingsong, > > > > I'm really excited about this feature! A native IVF-PQ implementation > > sounds very valuable for Paimon, especially if we can make it lake-native > > and avoid external native dependencies. > > > > One question I have is whether the current proposal is mainly focused on > > introducing a local vector index library, or whether it is also intended > to > > lay the groundwork for a more distributed vector index framework in > Paimon, > > similar to Lance's approach. > > > > If we want to support large-scale distributed vector indexes in the > future, > > I think there are several design questions that may affect the index file > > format, metadata, and build/query workflow: > > > > 1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids, > and > > PQ codebooks? For example, in Lance, global IVF centroids are trained > from > > global samples before workers build the actual index files, and PQ > > codebooks are also generated before distributed index writing. > > 2. How will distributed index building work? Will each worker write > > independent index files or index segments, and then commit them > through a > > global index manifest/catalog? Do we need to merge these files > physically, > > or is logical merging through metadata enough? > > 3. How will query routing work in a distributed setup? For example, > > after finding the top nprobe global IVF lists for a query, how do we > map > > these list IDs to physical index files or workers? > > 4. Do we plan to support incremental index build and index compaction? > > > > I don't think all of these need to be solved in the first version, but it > > would be great if the proposal could clarify the long-term direction. My > > main concern is that these distributed/snapshot-aware requirements may > > influence the index library abstraction and file format from the > beginning. > > > > On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]> > wrote: > > > > > Hi devs, > > > > > > I'd like to propose adding a native IVF-PQ (Inverted File with Product > > > Quantization) vector index to Paimon. I have a working prototype and > > > would like to get community feedback on the direction before opening a > > > formal proposal. > > > > > > ## Motivation > > > > > > Paimon currently supports vector search via Lumina (DiskANN), which > > > depends on an external native library (`lumina-jni` from Alibaba > > > Cloud). While DiskANN is a great algorithm, the external dependency > > > brings several challenges: > > > > > > 1. **Availability**: The library is hosted on a private Maven > > > repository, not on Maven Central. This creates a supply chain risk for > > > the open-source project. > > > 2. **Single algorithm**: DiskANN is graph-based and has its strengths, > > > but IVF-PQ is the industry standard for large-scale ANN with good > > > accuracy-speed-memory trade-offs, especially when combined with OPQ. > > > > > > ## Why not just use Faiss? > > > > > > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's > > > architecture: > > > 1. **No InputStream abstraction**: Faiss's `read_index` / > > > `write_index` operates on local files or memory buffers. There is no > > > way to plug in a custom I/O layer. In Paimon, index files live on > > > object storage (S3/HDFS/OSS) and are accessed via > > > `SeekableInputStream`. To use Faiss, we would have to download the > > > entire index file to local disk first — defeating the purpose of a > > > data lake architecture. > > > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI, > > > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers. > > > Redirecting these to Java's `SeekableInputStream` would require > > > intercepting every low-level C read call, which is fragile and has > > > high per-call overhead. Our Rust implementation uses a `SeekRead` > > > trait that maps cleanly to JNI callbacks at the inverted-list > > > granularity (a few bulk reads per query, not thousands of small > > > reads). > > > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a > > > large C++ framework (~200K lines) with GPU support, multi-index > > > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ > > > search path. A focused Rust implementation covers this in ~3K lines, > > > is easier to maintain, and has zero system library requirements (no > > > BLAS, LAPACK, or OpenMP). > > > 4. **Cross-language consistency**: Faiss has separate C++ and Python > > > interfaces but no Java interface. With Rust, one codebase produces > > > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings, > > > guaranteeing identical behavior and file format compatibility. > > > > > > Our implementation aligns with Faiss's algorithmic details (ADC, > > > precomputed tables, OPQ, k-means++ empty cluster handling) so the > > > search quality is equivalent — we just own the I/O layer. > > > > > > ## Proposal > > > > > > Build a **pure Rust IVF-PQ** implementation inside the Paimon > > > repository, with JNI bindings for Java and PyO3 bindings for Python. > > > Key design goals: > > > > > > - **SeekableInputStream native**: The file format is designed for > > > remote storage — offset table precedes inverted list data, so queries > > > only read `nprobe` lists via seek+read, not the entire file. > > > - **Faiss-aligned algorithms**: Same ADC with precomputed distance > > > tables, residual encoding, k-means++ with Faiss-style empty cluster > > > handling, OPQ rotation via Procrustes+SVD. > > > - **Zero external dependency**: The core library uses only pure Rust > > > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS, > > > LAPACK, or OpenMP required. > > > - **One codebase, two languages**: A single Rust core serves both the > > > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent > > > behavior. > > > > > > We need to create a new `paimon-vector-index` sub-project repository. > > > > > > What do you think? > > > > > > Best, > > > Jingsong > > > >
