JingsongLi opened a new pull request, #8174: URL: https://github.com/apache/paimon/pull/8174
## Summary - Integrate `apache/paimon-vector-index` (pure Rust IVF-PQ) into Paimon's GlobalIndex SPI framework - Two-level module structure following `paimon-tantivy` pattern: `paimon-ivfpq-jni` (Java JNI bindings + NativeLoader) and `paimon-ivfpq-index` (Paimon GlobalIndexer integration) - Register new `GlobalIndexerFactory` with identifier `"ivfpq"` via SPI ## Key Design Decisions - **No stream adapter classes**: IVF-PQ JNI calls `seek()`/`read()`/`write()` by method name on any Java object — `SeekableInputStream` and `PositionOutputStream` are directly compatible - **Native Roaring bitmap filter pushdown**: pass serialized `byte[]` directly to Rust instead of materializing `long[]` of all IDs (much more efficient for large cardinalities) - **Reservoir sampling for training**: configurable `ivfpq.train.sample_ratio` to control memory usage during IVF-PQ codebook training - **Batched vector insertion**: configurable `ivfpq.add.batch_size` for memory-efficient index building ## Configuration Options | Option | Default | Description | |--------|---------|-------------| | `ivfpq.index.dimension` | 128 | Vector dimension | | `ivfpq.distance.metric` | inner_product | Distance metric (l2, cosine, inner_product) | | `ivfpq.nlist` | 256 | Number of IVF partitions | | `ivfpq.m` | 16 | Number of PQ sub-quantizers (must divide dimension) | | `ivfpq.use_opq` | false | Use Optimized Product Quantization | | `ivfpq.nprobe` | 16 | Partitions to probe at search time | | `ivfpq.train.sample_ratio` | 1.0 | Training sample ratio | | `ivfpq.add.batch_size` | 10000 | Batch size for addVectors | ## Test plan - [ ] Build native library from `apache/paimon-vector-index` and copy to resources - [ ] Verify `mvn compile` passes (confirmed locally) - [ ] Run end-to-end vector search tests with different metrics and dimensions - [ ] Verify Roaring bitmap filter pushdown works correctly - [ ] CI workflow (`utcase-ivfpq.yml`) builds and tests automatically 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
