Hi devs, I'd like to propose adding a native IVF-PQ (Inverted File with Product Quantization) vector index to Paimon. I have a working prototype and would like to get community feedback on the direction before opening a formal proposal.
## Motivation Paimon currently supports vector search via Lumina (DiskANN), which depends on an external native library (`lumina-jni` from Alibaba Cloud). While DiskANN is a great algorithm, the external dependency brings several challenges: 1. **Availability**: The library is hosted on a private Maven repository, not on Maven Central. This creates a supply chain risk for the open-source project. 2. **Single algorithm**: DiskANN is graph-based and has its strengths, but IVF-PQ is the industry standard for large-scale ANN with good accuracy-speed-memory trade-offs, especially when combined with OPQ. ## Why not just use Faiss? Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's architecture: 1. **No InputStream abstraction**: Faiss's `read_index` / `write_index` operates on local files or memory buffers. There is no way to plug in a custom I/O layer. In Paimon, index files live on object storage (S3/HDFS/OSS) and are accessed via `SeekableInputStream`. To use Faiss, we would have to download the entire index file to local disk first — defeating the purpose of a data lake architecture. 2. **JNI callback is impractical**: Even wrapping Faiss via JNI, Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers. Redirecting these to Java's `SeekableInputStream` would require intercepting every low-level C read call, which is fragile and has high per-call overhead. Our Rust implementation uses a `SeekRead` trait that maps cleanly to JNI callbacks at the inverted-list granularity (a few bulk reads per query, not thousands of small reads). 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a large C++ framework (~200K lines) with GPU support, multi-index quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ search path. A focused Rust implementation covers this in ~3K lines, is easier to maintain, and has zero system library requirements (no BLAS, LAPACK, or OpenMP). 4. **Cross-language consistency**: Faiss has separate C++ and Python interfaces but no Java interface. With Rust, one codebase produces both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings, guaranteeing identical behavior and file format compatibility. Our implementation aligns with Faiss's algorithmic details (ADC, precomputed tables, OPQ, k-means++ empty cluster handling) so the search quality is equivalent — we just own the I/O layer. ## Proposal Build a **pure Rust IVF-PQ** implementation inside the Paimon repository, with JNI bindings for Java and PyO3 bindings for Python. Key design goals: - **SeekableInputStream native**: The file format is designed for remote storage — offset table precedes inverted list data, so queries only read `nprobe` lists via seek+read, not the entire file. - **Faiss-aligned algorithms**: Same ADC with precomputed distance tables, residual encoding, k-means++ with Faiss-style empty cluster handling, OPQ rotation via Procrustes+SVD. - **Zero external dependency**: The core library uses only pure Rust crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS, LAPACK, or OpenMP required. - **One codebase, two languages**: A single Rust core serves both the Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent behavior. We need to create a new `paimon-vector-index` sub-project repository. What do you think? Best, Jingsong
