xiangfu0 opened a new pull request, #18091: URL: https://github.com/apache/pinot/pull/18091
## Summary - Add **IVF_PQ** (Inverted File with Product Quantization) as a second vector index backend alongside HNSW - Pure Java implementation for immutable/offline segments with configurable recall/latency tradeoffs - Runtime query options (`vectorNprobe`, `vectorExactRerank`) for per-query tuning without SQL changes - Correct distance function support for COSINE, EUCLIDEAN, and INNER_PRODUCT ## Design ### Core Implementation - KMeans++ clustering for IVF coarse centroids and PQ sub-quantizer training - Residual-based Product Quantization with asymmetric distance computation (ADC) - Binary index format (`.vector.ivfpq.index`) with versioned header, coarse centroids, PQ codebooks, inverted lists with original vectors for exact rerank - COSINE vectors normalized at build/search time so L2-based PQ scoring produces correct cosine ranking ### Runtime Integration - `vectorNprobe` query option overrides config nprobe at query time - `vectorExactRerank` query option (default `true` for IVF_PQ) — 4x candidate oversampling then exact distance rerank - Explain/debug output shows backend, nprobe, rerank status, distance function - IVF_PQ returns `null` for mutable segments (falls back to exact scan) ### Lifecycle - V1→V3 converter copies IVF_PQ files - `VectorIndexUtils.hasVectorIndex()` detects IVF_PQ - `VectorIndexUtils.cleanupVectorIndex()` deletes IVF_PQ files - `VectorIndexHandler` detects HNSW↔IVF_PQ backend mismatch and triggers rebuild - V3 subdirectory detection for backend migration ### Benchmark Results (10K vectors, 64-dim) | Config | Recall@10 | p50 (us) | Index Size | |--------|-----------|---------|-----------| | ExactScan | 1.00 | 679 | 0 | | IVF_PQ(nl=128,m=16,np=16,rerank) | 0.51 | 253 | 2.73 MB | | IVF_PQ(nl=256,m=16,np=32,rerank) | 0.62 | 483 | 2.76 MB | Exact rerank improves Recall@10 by ~20% with <5% latency overhead. ## Test plan - [x] KMeansTest — 5 tests (clustering, findNearest, edge cases) - [x] ProductQuantizerTest — 5 tests (train/encode, distance tables, order preservation) - [x] IvfPqVectorIndexTest — 4 tests (create+search round-trip, recall, empty index, small dataset) - [x] IvfPqRuntimeTest — 8 tests (nprobe override, rerank on/off, COSINE distance, INNER_PRODUCT, debug info) - [x] VectorIndexTest — existing HNSW backward compat ✓ - [x] VectorConfigTest — existing config serde ✓ - [x] Benchmark suite (disabled by default, run manually) - [x] Pre-commit: spotless, checkstyle, license all pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
