JingsongLi opened a new pull request, #19: URL: https://github.com/apache/paimon-vector-index/pull/19
## Summary This PR adds an in-memory IVF_HNSW_FLAT prototype on top of the merged IVF_FLAT baseline. It keeps raw vectors in IVF lists and builds a local HNSW graph per list to speed up large-list searches while preserving FLAT ranking semantics for returned candidates. ## Changes - Add a deterministic HNSW graph implementation with greedy upper-layer search, bottom-layer beam search, bounded neighbor degree, and diversified neighbor selection. - Add , composing storage with per-list instances. - Preserve exact-scan fallback for missing graphs and selective row-id filters. - Extend to compare IVF-PQ, IVF-FLAT, and IVF-HNSW-FLAT across and settings. ## Benchmark Notes On the included synthetic benchmark, the large-list scenario improved from the earlier HNSW prototype's recall plateau to matching IVF-FLAT recall: - , n=50k, nlist=8: IVF-HNSW-FLAT reached 100% recall@10 for tested values. - In the same scenario, IVF-HNSW-FLAT is faster than IVF-FLAT for lower values on large lists. - On small lists, IVF-HNSW-FLAT matches IVF-FLAT recall but is slower, as expected from graph traversal overhead. ## Testing - - running 92 tests test distance::tests::test_fvec_distance_by_metric ... ok test blas::tests::test_sgemm_rectangular ... ok test blas::tests::test_sgemm_a_bt ... ok test distance::tests::test_normalize ... ok test distance::tests::test_l2sqr ... ok test distance::tests::test_inner_product ... ok test distance::tests::test_pq_distance_scalar ... ok test hnsw::tests::test_hnsw_empty_graph_returns_no_results ... ok test fastscan::tests::test_pack_unpack_roundtrip ... ok test fastscan::tests::test_fastscan_correctness ... ok test io::tests::test_delta_ids_wraparound_returns_error ... ok test io::tests::test_d_not_equal_m_times_dsub_returns_error ... ok test io::tests::test_huge_opq_offset_returns_error ... ok test io::tests::test_delta_varint_ids_roundtrip ... ok test io::tests::test_corrupt_delta_ids_returns_error ... ok test io::tests::test_huge_pq_section_size_returns_error ... ok test io::tests::test_large_gap_ids_roundtrip ... ok test io::tests::test_negative_header_d_returns_error ... ok test io::tests::test_negative_header_nlist_returns_error ... ok test io::tests::test_negative_id_bytes_len_returns_error ... ok test io::tests::test_negative_list_count_returns_error ... ok test io::tests::test_space_savings ... ignored test fastscan::tests::test_fastscan_large ... ok test io::tests::test_unsupported_ksub_returns_error ... ok test io::tests::test_varint_above_u64_max_returns_error ... ok test io::tests::test_varint_roundtrip ... ok test ivfflat::tests::test_ivfflat_search_with_filter ... ok test ivfflat::tests::test_ivfflat_add_assigns_all_vectors ... ok test ivfflat::tests::test_ivfflat_recalls_query_vector ... ok test ivfflat_io::tests::test_ivfflat_batch_reader_matches_single_reader_search ... ok test ivfflat_io::tests::test_ivfflat_batch_reader_search_with_roaring_filter_bytes ... ok test ivfflat_io::tests::test_ivfflat_reader_rejects_bad_magic ... ok test ivfflat_io::tests::test_ivfflat_reader_search_with_filter ... ok test ivfflat_io::tests::test_ivfflat_reader_search_with_roaring_filter_bytes ... ok test ivfflat_io::tests::test_ivfflat_write_read_search_roundtrip ... ok test hnsw::tests::test_hnsw_recalls_query_vector_on_single_partition ... ok test ivfhnswflat::tests::test_ivfhnswflat_selective_filter_uses_exact_results ... ok test ivfflat_io::tests::test_ivfflat_reader_validates_inputs ... ok test ivfflat_io::tests::test_ivfflat_writer_validates_shape_before_writing ... ok test ivfhnswflat::tests::test_ivfhnswflat_without_built_graphs_falls_back_to_flat_scan ... ok test ivfhnswflat::tests::test_ivfhnswflat_recalls_query_vector ... ok test hnsw::tests::test_hnsw_respects_neighbor_degree_bound ... ok test io::tests::test_read_inverted_list_uses_pread_after_metadata_loaded ... ok test io::tests::test_write_read_roundtrip_delta_ids ... ok test io::tests::test_write_read_4bit ... ok test ivfpq::tests::test_build_and_search_l2 ... ok test ivfpq::tests::test_big_batch_search ... ok test ivfpq::tests::test_batch_reader_empty_roaring_filter_returns_empty_results ... ok test ivfpq::tests::test_max_codes_early_termination ... ok test ivfpq::tests::test_batch_reader_validates_inputs ... ok test ivfpq::tests::test_merge_rejects_incompatible_training_state ... ok test ivfpq::tests::test_fastscan_invalidated_after_add ... ok test ivfpq::tests::test_from_trained_and_merge ... ok test ivfpq::tests::test_precomputed_table_invalidated_after_add ... ok test ivfpq::tests::test_precomputed_table_matches_normal_search ... ok test ivfpq::tests::test_reader_search_rejects_invalid_roaring_filter_bytes ... ok test ivfpq::tests::test_reader_search_validates_inputs ... ok test ivfpq::tests::test_build_and_search_ip ... ok test ivfpq::tests::test_batch_reader_matches_single_reader_search ... ok test hnsw::tests::test_hnsw_large_partition_recall_tracks_exact_search ... ok test ivfpq::tests::test_write_read_search ... ok test ivfpq::tests::test_reader_search_with_roaring_filter_bytes ... ok test kmeans::tests::test_find_topk ... ok test ivfpq::tests::test_reader_search_works_without_concurrent_pread ... ok test ivfpq::tests::test_4bit_ivfpq ... ok test ivfpq::tests::test_write_read_search_with_filter ... ok test kmeans::tests::test_two_clusters ... ok test opq::tests::test_apply_reverse ... ok test kmeans::tests::test_hot_start_converges_faster ... ok test ivfpq::tests::test_batch_search ... ok test pq::tests::test_4bit_batch_encode ... ok test ivfpq::tests::test_search_with_filter ... ok test kmeans::tests::test_hierarchical_kmeans ... ok test pq::tests::test_distance_table ... ok test pq::tests::test_encode_decode_roundtrip ... ok test shuffler::tests::test_new_does_not_open_one_file_per_partition ... ok test shuffler::tests::test_read_partition_requires_finish_write ... ok test shuffler::tests::test_read_partition_validates_partition_id ... ok test shuffler::tests::test_read_partition_uses_partition_index ... ok test shuffler::tests::test_write_vector_validates_dim ... ok test shuffler::tests::test_write_vector_validates_partition_id ... ok test shuffler::tests::test_shuffler_roundtrip ... ok test kmeans::tests::test_streaming_coreset_kmeans ... ok test shuffler::tests::test_write_vector_limits_open_partition_writers ... ok test pq::tests::test_4bit_encode_decode ... ok test io::tests::test_read_inverted_list_falls_back_for_old_delta_offset_table ... ok test ivfpq::tests::test_batch_reader_search_with_roaring_filter_bytes ... ok test opq::tests::test_rotation_orthogonality ... ok test ivfpq::tests::test_opq_4bit ... ok test ivfpq::tests::test_merge_rejects_incompatible_opq_rotation ... ok test ivfpq::tests::test_opq_ip ... ok test ivfpq::tests::test_opq_cosine ... ok test result: ok. 91 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out; finished in 4.83s running 0 tests test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s - - - running 8 tests test tests::python_flat_writer_can_build_an_index_for_reader ... ok test tests::python_writer_rejects_short_writes ... ok test tests::python_flat_reader_accepts_roaring_filter_bytes ... ok test tests::python_flat_batch_search_accepts_roaring_filter_bytes ... ok test tests::python_batch_search_returns_2d_numpy_arrays ... ok test tests::python_writer_can_build_an_index_for_reader ... ok test tests::python_batch_search_validates_query_shape ... ok test tests::python_batch_search_accepts_roaring_filter_bytes ... ok test result: ok. 8 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.67s - - === IVF Recall Attribution Benchmark === scenario: small-lists, n=20000, nq=50, d=64, nlist=64, avg_list=312, k=10, metric=L2 ground truth: 0.03s build IVF-PQ: 0.30s build IVF-FLAT: 0.09s build IVF-HNSW-FLAT: 0.52s index nprobe ef recall@10 query_ms us/query --------- ------ ------ --------- -------- -------- IVF-PQ 1 - 41.80% 0.24 4.7 IVF-FLAT 1 - 77.80% 0.66 13.2 IVF-HNSW 1 80 77.80% 1.83 36.6 IVF-PQ 4 - 49.20% 0.29 5.9 IVF-FLAT 4 - 99.80% 1.18 23.7 IVF-HNSW 4 80 99.80% 4.29 85.7 IVF-PQ 8 - 49.20% 0.30 6.1 IVF-FLAT 8 - 100.00% 2.14 42.7 IVF-HNSW 8 80 100.00% 8.89 177.9 IVF-PQ 16 - 49.20% 0.38 7.6 IVF-FLAT 16 - 100.00% 3.96 79.1 IVF-HNSW 16 80 100.00% 16.84 336.8 IVF-PQ 32 - 49.20% 0.53 10.6 IVF-FLAT 32 - 100.00% 7.56 151.3 IVF-HNSW 32 80 100.00% 32.28 645.7 IVF-PQ 64 - 49.20% 0.70 14.1 IVF-FLAT 64 - 100.00% 14.00 279.9 IVF-HNSW 64 80 100.00% 64.97 1299.4 === IVF Recall Attribution Benchmark === scenario: large-lists, n=50000, nq=50, d=64, nlist=8, avg_list=6250, k=10, metric=L2 ground truth: 0.07s build IVF-PQ: 0.59s build IVF-FLAT: 0.01s build IVF-HNSW-FLAT: 3.67s index nprobe ef recall@10 query_ms us/query --------- ------ ------ --------- -------- -------- IVF-PQ 1 - 19.00% 0.50 10.0 IVF-FLAT 1 - 100.00% 5.66 113.1 IVF-HNSW 1 80 100.00% 3.91 78.3 IVF-HNSW 1 160 100.00% 4.62 92.4 IVF-HNSW 1 320 100.00% 6.60 132.0 IVF-PQ 2 - 19.00% 0.56 11.2 IVF-FLAT 2 - 100.00% 13.08 261.6 IVF-HNSW 2 80 100.00% 5.98 119.5 IVF-HNSW 2 160 100.00% 9.38 187.6 IVF-HNSW 2 320 100.00% 13.09 261.7 IVF-PQ 4 - 19.00% 0.93 18.6 IVF-FLAT 4 - 100.00% 23.74 474.7 IVF-HNSW 4 80 100.00% 13.21 264.2 IVF-HNSW 4 160 100.00% 18.09 361.9 IVF-HNSW 4 320 100.00% 27.48 549.6 IVF-PQ 8 - 19.00% 1.17 23.4 IVF-FLAT 8 - 100.00% 35.15 703.1 IVF-HNSW 8 80 100.00% 22.50 450.0 IVF-HNSW 8 160 100.00% 33.59 671.7 IVF-HNSW 8 320 100.00% 53.23 1064.5 ## Notes This PR intentionally does not add an IVF_HNSW_FLAT disk format or Java/Python reader/writer APIs. Those should follow after the in-memory graph behavior is reviewed and the serialization layout is agreed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
