GitHub user chrevanthreddy added a comment to the discussion: RFC-104 - Add native Vector Search Index type for Hudi Vector Type
Benchmarking tests on 1Million dataset: The ivf-exact path does: batch_queries crossJoin broadcast(ivf_centroids) rank centroids per query by squared L2 keep nprobe join to ivf_vectors by cluster_id compute exact cosine distance rank per query by exact distance keep top-20 Observed 10k-query no-write result: IVF-only probe rows: 40000 IVF-only probe stage: 29.480s IVF-only materialize stage: 2042.473s IVF-only result rows: 200000 IVF-only total runtime (no GCS write): 2071.955s The ivf-rabitq path does: batch_queries crossJoin broadcast(ivf_centroids) rank centroids per query by squared L2 keep nprobe join to ivf_postings by cluster_id compute Hamming distance keep top 1000 candidates per query join surviving candidates to ivf_vectors compute exact cosine distance rank per query by exact distance keep top-20 Observed 10k-query no-write result: IVF probes rows: 40000 IVF candidate rows (top-1000 per query): 10000000 Probe stage: 133.598s Hamming stage: 175.685s IVF + RaBitQ materialize stage: 178.305s IVF + RaBitQ result rows: 200000 IVF + RaBitQ total runtime (no GCS write): 487.591s GitHub link: https://github.com/apache/hudi/discussions/18500#discussioncomment-16909264 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
