Hi Wang,

Thanks for the excellent questions! These are exactly the right things
to think about early. Let me address each one.

The short answer is: the first version focuses on a per-data-file
local index, but the design intentionally separates the "training
artifacts" (centroids, codebooks, OPQ rotation) from the "index data"
(inverted lists), which is the key prerequisite for distributed
indexing.

## Q1: Will IVF-PQ use globally trained centroids/codebooks?

Yes, this is the intended direction. The architecture naturally
supports a two-phase approach:

**Phase 1 (current):** Training and indexing happen together per data
file. Each file trains its own centroids and PQ codebooks. This is
simple and already works.

**Phase 2 (future):** Separate training from indexing:

```
Global Training (once per table/partition):
  Sample vectors from multiple files → train centroids + PQ codebooks
+ OPQ rotation
  → Store as a shared "training artifact" file

Distributed Build (per worker):
  Load shared training artifacts
  → Assign vectors to global IVF lists
  → Encode residuals with shared PQ codebooks
  → Write index file (only contains inverted list data, references
shared artifacts)
```

The file format already supports this separation — centroids and
codebooks are stored in a well-defined section of the file. In Phase
2, we can either:
- Embed the shared artifacts in each index file (simple, some redundancy)
- Or reference an external artifact file via metadata (no redundancy,
slightly more complex)

The `IVFPQIndexMeta` (stored in `GlobalIndexIOMeta.metadata()`) can be
extended with a `training_artifact_path` field without breaking
backward compatibility.

## Q2: How will distributed index building work?

Paimon's existing architecture already provides the distributed
skeleton. The key insight is that IVF-PQ's inverted lists are
**partitionable by list ID**:

```
Global IVF lists: [0, 1, 2, ..., nlist-1]

Worker 0 writes: index-file-0.ivfpq → contains lists {0, 3, 7, 12, ...}
Worker 1 writes: index-file-1.ivfpq → contains lists {1, 4, 8, 15, ...}
Worker 2 writes: index-file-2.ivfpq → contains lists {2, 5, 9, 11, ...}
```

Each worker writes an independent index file. **No physical merge is
needed** — Paimon's snapshot metadata already tracks multiple index
files per partition/bucket. The query layer reads the relevant files
based on metadata.

This is analogous to how BTree global index already works: BTree
stores `firstKey`/`lastKey` in `BTreeIndexMeta` to enable file pruning
via `BTreeFileMetaSelector`. For IVF-PQ, we can store the **list ID
set** (which IVF lists this file contains) in `IVFPQIndexMeta`:

```
BTree metadata:    {firstKey, lastKey}     → prune files by key range
IVF-PQ metadata:   {listIds: [3,7,12,...]} → prune files by IVF list assignment
```

The `GlobalIndexIOMeta` system supports this naturally — each
`ResultEntry` already carries a `byte[] meta` field. We just need to
serialize the list ID set there.

Logical merging through metadata is sufficient for reads. Physical
compaction (merging multiple index files into one) can be a background
optimization, similar to how Paimon compacts data files.

## Q3: How will query routing work?

With the metadata described above, the query flow for distributed IVF-PQ is:

```
1. Compute top-nprobe IVF list IDs from query vector (using shared centroids)
2. For each list ID, look up which index files contain that list (via metadata)
3. Read only those files, only the relevant inverted lists within each file
4. Merge results across files using a global top-k heap
```

Step 2 is the "file selector" step, directly analogous to
`BTreeFileMetaSelector`:

```java
// BTree: select files whose key range overlaps the query
BTreeFileMetaSelector → visitEqual(key) → filter by firstKey/lastKey

// IVF-PQ: select files whose list IDs overlap the probed lists
IVFPQFileMetaSelector → visitVectorSearch(query) → filter by listIds
```

This means the query only reads the minimum number of files and the
minimum number of inverted lists within each file — two levels of
pruning.

## Q4: Incremental index build and compaction?

**Incremental build:** When new data files are added, a new index file
is built for the new data using the shared training artifacts (same
centroids, same PQ codebooks). No need to rebuild existing index
files. The new file is registered in the snapshot metadata alongside
existing index files.

**Index compaction:** Multiple small index files can be merged into
larger ones as a background optimization. Since inverted lists are
independent, this is a simple concatenation per list ID — no
retraining needed.

**Re-training:** When data distribution drifts significantly, the
shared training artifacts can be retrained from fresh samples and all
index files rebuilt. This is a heavy operation but infrequent (similar
to how partition statistics are refreshed).

## Summary

The critical point: the file format and metadata abstraction in V1
should accommodate the distributed case. The centroids/codebooks
section is clearly separated from inverted list data, and
`IVFPQIndexMeta` is extensible. We don't need to redesign the format
when adding distributed support — we just need to add the
orchestration layer (global training coordinator, file selector,
multi-file merger).

Best,
Jingsong Li

On Thu, Jun 4, 2026 at 4:00 PM wang <[email protected]> wrote:
>
> Hi Jingsong,
>
> I'm really excited about this feature! A native IVF-PQ implementation
> sounds very valuable for Paimon, especially if we can make it lake-native
> and avoid external native dependencies.
>
> One question I have is whether the current proposal is mainly focused on
> introducing a local vector index library, or whether it is also intended to
> lay the groundwork for a more distributed vector index framework in Paimon,
> similar to Lance's approach.
>
> If we want to support large-scale distributed vector indexes in the future,
> I think there are several design questions that may affect the index file
> format, metadata, and build/query workflow:
>
>    1. Will IVF-PQ use globally trained OPQ rotation, coarse centroids, and
>    PQ codebooks? For example, in Lance, global IVF centroids are trained from
>    global samples before workers build the actual index files, and PQ
>    codebooks are also generated before distributed index writing.
>    2. How will distributed index building work? Will each worker write
>    independent index files or index segments, and then commit them through a
>    global index manifest/catalog? Do we need to merge these files physically,
>    or is logical merging through metadata enough?
>    3. How will query routing work in a distributed setup? For example,
>    after finding the top nprobe global IVF lists for a query, how do we map
>    these list IDs to physical index files or workers?
>    4. Do we plan to support incremental index build and index compaction?
>
> I don't think all of these need to be solved in the first version, but it
> would be great if the proposal could clarify the long-term direction. My
> main concern is that these distributed/snapshot-aware requirements may
> influence the index library abstraction and file format from the beginning.
>
> On Thu, Jun 4, 2026 at 3:05 PM Jingsong Li <[email protected]> wrote:
>
> > Hi devs,
> >
> > I'd like to propose adding a native IVF-PQ (Inverted File with Product
> > Quantization) vector index to Paimon. I have a working prototype and
> > would like to get community feedback on the direction before opening a
> > formal proposal.
> >
> > ## Motivation
> >
> > Paimon currently supports vector search via Lumina (DiskANN), which
> > depends on an external native library (`lumina-jni` from Alibaba
> > Cloud). While DiskANN is a great algorithm, the external dependency
> > brings several challenges:
> >
> > 1. **Availability**: The library is hosted on a private Maven
> > repository, not on Maven Central. This creates a supply chain risk for
> > the open-source project.
> > 2. **Single algorithm**: DiskANN is graph-based and has its strengths,
> > but IVF-PQ is the industry standard for large-scale ANN with good
> > accuracy-speed-memory trade-offs, especially when combined with OPQ.
> >
> > ## Why not just use Faiss?
> >
> > Faiss is the gold standard for IVF-PQ, but it does not fit Paimon's
> > architecture:
> > 1. **No InputStream abstraction**: Faiss's `read_index` /
> > `write_index` operates on local files or memory buffers. There is no
> > way to plug in a custom I/O layer. In Paimon, index files live on
> > object storage (S3/HDFS/OSS) and are accessed via
> > `SeekableInputStream`. To use Faiss, we would have to download the
> > entire index file to local disk first — defeating the purpose of a
> > data lake architecture.
> > 2. **JNI callback is impractical**: Even wrapping Faiss via JNI,
> > Faiss's internal I/O uses `fread`/`fseek` on `FILE*` pointers.
> > Redirecting these to Java's `SeekableInputStream` would require
> > intercepting every low-level C read call, which is fragile and has
> > high per-call overhead. Our Rust implementation uses a `SeekRead`
> > trait that maps cleanly to JNI callbacks at the inverted-list
> > granularity (a few bulk reads per query, not thousands of small
> > reads).
> > 3. **IVF-PQ only needs the algorithm, not the framework**: Faiss is a
> > large C++ framework (~200K lines) with GPU support, multi-index
> > quantizers, polysemous codes, etc. Paimon only needs the core IVF-PQ
> > search path. A focused Rust implementation covers this in ~3K lines,
> > is easier to maintain, and has zero system library requirements (no
> > BLAS, LAPACK, or OpenMP).
> > 4. **Cross-language consistency**: Faiss has separate C++ and Python
> > interfaces but no Java interface. With Rust, one codebase produces
> > both JNI (for Paimon Java engine) and PyO3 (for PyPaimon) bindings,
> > guaranteeing identical behavior and file format compatibility.
> >
> > Our implementation aligns with Faiss's algorithmic details (ADC,
> > precomputed tables, OPQ, k-means++ empty cluster handling) so the
> > search quality is equivalent — we just own the I/O layer.
> >
> > ## Proposal
> >
> > Build a **pure Rust IVF-PQ** implementation inside the Paimon
> > repository, with JNI bindings for Java and PyO3 bindings for Python.
> > Key design goals:
> >
> > - **SeekableInputStream native**: The file format is designed for
> > remote storage — offset table precedes inverted list data, so queries
> > only read `nprobe` lists via seek+read, not the entire file.
> > - **Faiss-aligned algorithms**: Same ADC with precomputed distance
> > tables, residual encoding, k-means++ with Faiss-style empty cluster
> > handling, OPQ rotation via Procrustes+SVD.
> > - **Zero external dependency**: The core library uses only pure Rust
> > crates (`matrixmultiply`, `nalgebra`, `rayon`). No system BLAS,
> > LAPACK, or OpenMP required.
> > - **One codebase, two languages**: A single Rust core serves both the
> > Java engine (via JNI) and PyPaimon (via PyO3), ensuring consistent
> > behavior.
> >
> > We need to create a new `paimon-vector-index` sub-project repository.
> >
> > What do you think?
> >
> > Best,
> > Jingsong
> >

Reply via email to