Hi Jingsong, I'm really excited about this feature! A native IVF-PQ implementation sounds very valuable for Paimon, especially if we can make it lake-native and avoid external native dependencies.
One question I have is whether the current proposal is mainly focused on introducing a local vector index library, or whether it is also intended to lay the groundwork for a more distributed vector index framework in Paimon, similar to Lance's approach. If we want to support large-scale distributed vector indexes in the future, I think there are several design questions that may affect the index file format, metadata, and build/query workflow: 1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids, and PQ codebooks? For example, in Lance, global IVF centroids are trained from global samples before workers build the actual index files, and PQ codebooks are also generated before distributed index writing. 2. How will distributed index building work? Will each worker write independent index files or index segments, and then commit them through a global index manifest/catalog? Do we need to merge these files physically, or is logical merging through metadata enough? 3. How will query routing work in a distributed setup? For example, after finding the top nprobe global IVF lists for a query, how do we map these list IDs to physical index files or workers? 4. Do we plan to support incremental index build and index compaction? I don't think all of these need to be solved in the first version, but it would be great if the proposal could clarify the long-term direction. My main concern is that these distributed/snapshot-aware requirements may influence the index library abstraction and file format from the beginning. On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]> wrote: > Hi devs, > > I'd like to propose adding a native IVF-PQ (Inverted File with Product > Quantization) vector index to Paimon. I have a working prototype and > would like to get community feedback on the direction before opening a > formal proposal. > > ## Motivation > > Paimon currently supports vector search via Lumina (DiskANN), which > depends on an external native library (`lumina-jni` from Alibaba > Cloud). While DiskANN is a great algorithm, the external dependency > brings several challenges: > > 1. **Availability**: The library is hosted on a private Maven > repository, not on Maven Central. This creates a supply chain risk for > the open-source project. > 2. **Single algorithm**: DiskANN is graph-based and has its strengths, > but IVF-PQ is the industry standard for large-scale ANN with good > accuracy-speed-memory trade-offs, especially when combined with OPQ. > > ## Why not just use Faiss? > > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's > architecture: > 1. **No InputStream abstraction**: Faiss's `read_index` / > `write_index` operates on local files or memory buffers. There is no > way to plug in a custom I/O layer. In Paimon, index files live on > object storage (S3/HDFS/OSS) and are accessed via > `SeekableInputStream`. To use Faiss, we would have to download the > entire index file to local disk first — defeating the purpose of a > data lake architecture. > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI, > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers. > Redirecting these to Java's `SeekableInputStream` would require > intercepting every low-level C read call, which is fragile and has > high per-call overhead. Our Rust implementation uses a `SeekRead` > trait that maps cleanly to JNI callbacks at the inverted-list > granularity (a few bulk reads per query, not thousands of small > reads). > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a > large C++ framework (~200K lines) with GPU support, multi-index > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ > search path. A focused Rust implementation covers this in ~3K lines, > is easier to maintain, and has zero system library requirements (no > BLAS, LAPACK, or OpenMP). > 4. **Cross-language consistency**: Faiss has separate C++ and Python > interfaces but no Java interface. With Rust, one codebase produces > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings, > guaranteeing identical behavior and file format compatibility. > > Our implementation aligns with Faiss's algorithmic details (ADC, > precomputed tables, OPQ, k-means++ empty cluster handling) so the > search quality is equivalent — we just own the I/O layer. > > ## Proposal > > Build a **pure Rust IVF-PQ** implementation inside the Paimon > repository, with JNI bindings for Java and PyO3 bindings for Python. > Key design goals: > > - **SeekableInputStream native**: The file format is designed for > remote storage — offset table precedes inverted list data, so queries > only read `nprobe` lists via seek+read, not the entire file. > - **Faiss-aligned algorithms**: Same ADC with precomputed distance > tables, residual encoding, k-means++ with Faiss-style empty cluster > handling, OPQ rotation via Procrustes+SVD. > - **Zero external dependency**: The core library uses only pure Rust > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS, > LAPACK, or OpenMP required. > - **One codebase, two languages**: A single Rust core serves both the > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent > behavior. > > We need to create a new `paimon-vector-index` sub-project repository. > > What do you think? > > Best, > Jingsong >
