vigyasharma opened a new issue, #15612:
URL: https://github.com/apache/lucene/issues/15612

   Our current HNSW vector search is great with low latency and high recall, 
but comes with its own set of challenges. Graph based algos are memory 
intensive which make them hard to scale. And they don't work well with segment 
merge. We have to rebuild the graph when merging segments, which is expensive 
and limiting.
   
   Cluster based vector search algorithms seem to provide a promising 
alternative. At a high level, they partition vectors into a bunch of clusters, 
each represented by a centroid. Vectors within the partition are stored as a 
posting list and queries do an exhaustive search on nearest centroid partition. 
This has some neat benefits:
   1) Only the centroids need to be kept in memory. The postings can be loaded 
from disk as needed. This scales well for larger than memory indices.
   2) Clusters can be merged efficiently across segments.
   3) Once you identify the nearest centroid for a query, you only need to look 
into a single partition. This works well for storage / compute separation use 
cases where indices live in a remote store (like serverless vector search 
offerings). The "hot" in-memory part of the index is small (only centroids). 
Once a nearest centroid is identified, you only need to fetch vectors for that 
specific partition, which saves network cost and latency.
   4) These algos do an exhaustive search within a posting, which works well 
with filtering. We can order the postings on a specific field (index sort), and 
skip a bunch of filtered out vectors altogether.
   
   The SPANN paper [1] achieves strong results through two key constraints: 
bounded partition size and nearest partition assignment (NPA). SPFresh [2] 
builds on top of it with its 'Lightweight Incremental REbalancing' (LIRE) 
approach for in-place updates. However, these papers don't map directly to 
Lucene's segment based architecture. SPANN does an offline index build and 
SPFresh updates the index in-place.
   
   This issue is to brainstorm how we can adapt insights from these papers into 
an effective cluster based vector search approach for Lucene's write once 
segment based architecture. Sharing some initial thoughts to get the ball 
rolling. Would love to hear from experts in the community.
   
   ...
   
   ### Technical Proposal
   
   **Basic Setup:** We continue to leverage flat vector format for vector 
ordinals and raw vectors. Ordinals continue to be unique in segment scope. 
Indexing simply accumulates vectors in memory and assigns them ordinals 
(FlatVectorsWriter). At flush, if the no. of vectors are above a threshold, we 
create our cluster based index. Otherwise, for a small no. of vectors, we only 
keep the flat format and always do an exhaustive search. This is similar to 
what we have for HNSW today.
   
   **Segment Flush:** We create the cluster based vector index during segment 
flush.
   1. **Clustering on Vectors:** We cluster the vectors into a configurable no. 
of partitions, each represented by a centroid. Performance improves with more 
partitions and the SPANN paper notes that gains plateau once centroids reach 
~16% of data points. We can keep the default `min_partitions` value at 12% - 
16% of vector corpus size.
   
       I propose we go with simple k-means clustering approach as a first cut. 
SPANN does Hierarchical Balanced Clustering (HBC) which provides better recall 
for latency. Similar work by PlanetScale [4] found that randomly selecting 
centroids works just as well when the corpus is large. Perhaps we start with 
k-means and follow up with more sophisticated improvements? 
       
   2. **Centroid Index:** We need an efficient data structure on centroids to 
find the nearest partition at query time (and some partition reassignment jobs 
during indexing). SPANN [1] uses "SPTAG", which is a mix of KD tree and a graph 
pruned with RNG heuristics.
   We can instead leverage Lucene's existing pieces to create an HNSW index on 
the centroids. This gives an efficient way of finding the nearest partitions 
(centroids) at query time. Since the HNSW graph is only on centroids, we use < 
20% of memory that would otherwise be needed in an hnsw only approach. This 
seems to have worked well for PlanetScale [4].
      
   3. **Create Postings:** Create posting lists for the vectors assigned to 
each cluster. The list can be ordered by index-sort to allow efficient 
filtering and skipping. 
      
   4. **Postings Expansion to Improve Recall:** Since real world data shapes 
are seldom ideal, it's likely that some vectors will violate nearest partition 
assignment, and get missed when a query scans only the nearest centroid 
posting. The papers [1],[2] suggest a "postings expansion" step where vectors 
on the outskirts of the cluster are also assigned to nearby postings to improve 
recall. We apply this to our approach in the following way:
   
       For each centroid, we find `k` nearby centroids using the hnsw graph and 
consider these as additional targets for vectors in the posting. A vector is 
assigned to these centroids (partitions) if its distance from the centroid is 
almost the same as its distance from the nearest centroid. The small delta of 
distance allowed will be configurable. 
      
       Additionally, we apply Relative Neighborhood Graph (RNG) pruning: if 
posting A is closer to posting B than to the vector, we only assign the vector 
to posting B, avoiding redundant assignments to nearby postings. The `k` or no. 
of postings to replicate the vector in ("postings_replication_factor"?) will be 
configurable. The paper [1] reports good performance at 
`postings_replication_factor=8`.
      
   6. **Write Metadata:** We'll need a new vector format for this. It can reuse 
parts of the hnsw metadata format to store the centroid graph. Additionally, it 
will store the postings for each centroid.
   
   .
   
   **Query Flow:** For an incoming vector search query, we first find `k` 
(configurable) nearest centroids using the HNSW graph lookup. Of these, we 
consider the nearest centroid, and any additional centroids that are within a 
small distance delta over the nearest centroid. We fetch the postings from the 
selected centroid partitions and do an exhaustive scan over the vectors. Using 
posting list order, we can also jump ahead in this step to skip over filtered 
out documents.
   
   .
   
   **Handling Updates and Deletes:** Deletes and updates are handled as usual 
per Lucene's segment based architecture. The writes go to the latest segment 
and deletes flip the `liveDocs` bits. With efficient merging, we can combine 
segments frequently and keep our search performance in check.
   
   .
   
   **Segment Merges:** We need to adapt the cluster based approach to run 
effectively with Lucene's write-once segmented architecture. I propose the 
following goals for our merge design:
   1) Efficiency: The approach should leverage the structure of merging 
segments as much as possible. Idea is to reuse past work done in building those 
segments.
   2) Concurrency friendly: It'll be great if our merge algorithm is easy to 
parallelize. 
   3) Correctness: We want the final merged segment to have balanced clusters, 
postings within size limit, and nearest partition assignment (NPA) for vectors.
   
   #### Potential Approach for Segment Merges
   
   1. **Centroid Selection and Postings Merge:** We cluster existing centroids 
from input segments. To keep merges fast, we prioritize retaining the centroid 
of the largest participating posting. Smaller postings are appended to the 
largest one, combining them into one single partition for each centroid 
cluster. This minimizes the need to re-calculate distances for the majority of 
vectors in a new partition. We follow this up with reassignment step on vectors 
from the deleted centroid postings to maintain the NPA property.
   
   2. **Create Graph for the New Centroids:** Before we split/merge partitions 
or reassign vectors, we need an efficient way to identify nearest centroids. To 
do this, we create an HNSW graph for the centroids remaining after postings in 
each cluster were merged. We may create more centroids if postings are split. 
These can simply be added to the HNSW graph.
   
   3. **Reassign Vectors in Merged Postings:** Inspired by the SPFresh paper 
[2], we don't need to touch the vectors that remained in their original 
posting. Our "candidates vectors" are only the ones that were appended to the 
largest posting. We pick `k` postings nearest to each centroid as "target 
postings" for the reassignment. This `k` is a configurable parameter called 
`reassign_range`. Recall improves as we increase it; SPFresh [2] suggests 
`reassign_range=64` works well.
      
   4. **Split Large Postings:** Post reassignment, we verify that each posting 
is within the max configured size limit. This is important to contain tail 
latency. Larger postings must be split into more partitions. No. of split 
partitions is determined based on the max allowed posting size, and required 
no. of centroids are determined using k-means clustering. 
   
       Vectors are then reassigned to these centroids. When assigning vectors 
to the new post-split centroids, we also compare their distance to k 
`(reassign_range)` nearby postings. This ensures the vectors are assigned to 
their nearest partition, in case a neighbor was closer than the newly created 
centroid. All newly created centroids are also added to the centroid HNSW graph.
      
   5. **(Optional) Merge Small Postings:** We now have a new segment with  
postings that are all within the max allowed partition size. Postings that are 
too small (`< min_partition_size`), can now be merged into one segment, using 
the steps described above.
   We want to ensure that we don't flip-flop endlessly between split and merge 
steps. This can be handled with heuristics like only merging if the final 
posting size is `< max_partition_size`.
   
   ...
   
   This outlines my initial thoughts for merging segments. I suspect with 
benchmarking, we can find incremental improvements, like heuristics that 
contain the no. of vectors considered for reassignment. I've intentionally 
skipped details on how to organize and fetch postings from disk, hoping 
existing Lucene primitives can help us here.
   
   Looking forward to hearing from experts in the community who've thought 
about or tried similar approaches elsewhere.
   
   **References:**
   [1] [SPANN Paper](https://arxiv.org/pdf/2111.08566)
   [2] [SPFresh Paper](https://arxiv.org/pdf/2410.14452)
   [3] [An older RFC with a SPANN specific 
proposal](https://github.com/apache/lucene/issues/14997)
   [4] [PlanetScale Blog about a similarly inspired 
setup](https://planetscale.com/blog/larger-than-ram-vector-indexes-for-relational-databases)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to