JingsongLi opened a new pull request, #24: URL: https://github.com/apache/paimon-vector-index/pull/24
## Summary Improve IVF-HNSW build performance by parallelizing graph construction, reusing HNSW build/search workspaces, and adding SIMD L2 distance kernels. The recall benchmark now separates exact-scan baselines from HNSW build/search rows so HNSW build cost is easier to compare. ## Changes - Add an owned HNSW build path and a large-list parallel HNSW builder with precomputed levels and a LanceDB-style fixed high-level entry point. - Reuse build/search workspaces to reduce repeated allocation in HNSW insertion and probing. - Build IVF-HNSW-FLAT and IVF-HNSW-SQ graphs across IVF lists in parallel; avoid an extra decoded-vector copy in the SQ graph path. - Add NEON/AVX2 implementations for squared L2 distance, used by HNSW construction and scan/search paths. - Rework `recall_bench` output to include exact representation baselines and a separate HNSW section. ## Testing - `cargo fmt --check` - `cargo test -p paimon-vindex-core` - `cargo test -p paimon-vindex-core hnsw -- --nocapture` - `cargo bench -p paimon-vindex-core --bench recall_bench -- --nocapture` ## Benchmark Notes On the local arm64 benchmark scenario, large-list IVF-HNSW build time improved from roughly 3.2s initially to about 0.75s after these changes, while HNSW-FLAT recall remained around 99.4%-100% and HNSW-SQ matched the SQ baseline (~88.2%). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
