vigyasharma opened a new issue, #15612:
URL: https://github.com/apache/lucene/issues/15612
Our current HNSW vector search is great with low latency and high recall,
but comes with its own set of challenges. Graph based algos are memory
intensive which make them hard to scale. And they don't work well with segment
merge. We have to rebuild the graph when merging segments, which is expensive
and limiting.
Cluster based vector search algorithms seem to provide a promising
alternative. At a high level, they partition vectors into a bunch of clusters,
each represented by a centroid. Vectors within the partition are stored as a
posting list and queries do an exhaustive search on nearest centroid partition.
This has some neat benefits:
1) Only the centroids need to be kept in memory. The postings can be loaded
from disk as needed. This scales well for larger than memory indices.
2) Clusters can be merged efficiently across segments.
3) Once you identify the nearest centroid for a query, you only need to look
into a single partition. This works well for storage / compute separation use
cases where indices live in a remote store (like serverless vector search
offerings). The "hot" in-memory part of the index is small (only centroids).
Once a nearest centroid is identified, you only need to fetch vectors for that
specific partition, which saves network cost and latency.
4) These algos do an exhaustive search within a posting, which works well
with filtering. We can order the postings on a specific field (index sort), and
skip a bunch of filtered out vectors altogether.
The SPANN paper [1] achieves strong results through two key constraints:
bounded partition size and nearest partition assignment (NPA). SPFresh [2]
builds on top of it with its 'Lightweight Incremental REbalancing' (LIRE)
approach for in-place updates. However, these papers don't map directly to
Lucene's segment based architecture. SPANN does an offline index build and
SPFresh updates the index in-place.
This issue is to brainstorm how we can adapt insights from these papers into
an effective cluster based vector search approach for Lucene's write once
segment based architecture. Sharing some initial thoughts to get the ball
rolling. Would love to hear from experts in the community.
...
### Technical Proposal
**Basic Setup:** We continue to leverage flat vector format for vector
ordinals and raw vectors. Ordinals continue to be unique in segment scope.
Indexing simply accumulates vectors in memory and assigns them ordinals
(FlatVectorsWriter). At flush, if the no. of vectors are above a threshold, we
create our cluster based index. Otherwise, for a small no. of vectors, we only
keep the flat format and always do an exhaustive search. This is similar to
what we have for HNSW today.
**Segment Flush:** We create the cluster based vector index during segment
flush.
1. **Clustering on Vectors:** We cluster the vectors into a configurable no.
of partitions, each represented by a centroid. Performance improves with more
partitions and the SPANN paper notes that gains plateau once centroids reach
~16% of data points. We can keep the default `min_partitions` value at 12% -
16% of vector corpus size.
I propose we go with simple k-means clustering approach as a first cut.
SPANN does Hierarchical Balanced Clustering (HBC) which provides better recall
for latency. Similar work by PlanetScale [4] found that randomly selecting
centroids works just as well when the corpus is large. Perhaps we start with
k-means and follow up with more sophisticated improvements?
2. **Centroid Index:** We need an efficient data structure on centroids to
find the nearest partition at query time (and some partition reassignment jobs
during indexing). SPANN [1] uses "SPTAG", which is a mix of KD tree and a graph
pruned with RNG heuristics.
We can instead leverage Lucene's existing pieces to create an HNSW index on
the centroids. This gives an efficient way of finding the nearest partitions
(centroids) at query time. Since the HNSW graph is only on centroids, we use <
20% of memory that would otherwise be needed in an hnsw only approach. This
seems to have worked well for PlanetScale [4].
3. **Create Postings:** Create posting lists for the vectors assigned to
each cluster. The list can be ordered by index-sort to allow efficient
filtering and skipping.
4. **Postings Expansion to Improve Recall:** Since real world data shapes
are seldom ideal, it's likely that some vectors will violate nearest partition
assignment, and get missed when a query scans only the nearest centroid
posting. The papers [1],[2] suggest a "postings expansion" step where vectors
on the outskirts of the cluster are also assigned to nearby postings to improve
recall. We apply this to our approach in the following way:
For each centroid, we find `k` nearby centroids using the hnsw graph and
consider these as additional targets for vectors in the posting. A vector is
assigned to these centroids (partitions) if its distance from the centroid is
almost the same as its distance from the nearest centroid. The small delta of
distance allowed will be configurable.
Additionally, we apply Relative Neighborhood Graph (RNG) pruning: if
posting A is closer to posting B than to the vector, we only assign the vector
to posting B, avoiding redundant assignments to nearby postings. The `k` or no.
of postings to replicate the vector in ("postings_replication_factor"?) will be
configurable. The paper [1] reports good performance at
`postings_replication_factor=8`.
6. **Write Metadata:** We'll need a new vector format for this. It can reuse
parts of the hnsw metadata format to store the centroid graph. Additionally, it
will store the postings for each centroid.
.
**Query Flow:** For an incoming vector search query, we first find `k`
(configurable) nearest centroids using the HNSW graph lookup. Of these, we
consider the nearest centroid, and any additional centroids that are within a
small distance delta over the nearest centroid. We fetch the postings from the
selected centroid partitions and do an exhaustive scan over the vectors. Using
posting list order, we can also jump ahead in this step to skip over filtered
out documents.
.
**Handling Updates and Deletes:** Deletes and updates are handled as usual
per Lucene's segment based architecture. The writes go to the latest segment
and deletes flip the `liveDocs` bits. With efficient merging, we can combine
segments frequently and keep our search performance in check.
.
**Segment Merges:** We need to adapt the cluster based approach to run
effectively with Lucene's write-once segmented architecture. I propose the
following goals for our merge design:
1) Efficiency: The approach should leverage the structure of merging
segments as much as possible. Idea is to reuse past work done in building those
segments.
2) Concurrency friendly: It'll be great if our merge algorithm is easy to
parallelize.
3) Correctness: We want the final merged segment to have balanced clusters,
postings within size limit, and nearest partition assignment (NPA) for vectors.
#### Potential Approach for Segment Merges
1. **Centroid Selection and Postings Merge:** We cluster existing centroids
from input segments. To keep merges fast, we prioritize retaining the centroid
of the largest participating posting. Smaller postings are appended to the
largest one, combining them into one single partition for each centroid
cluster. This minimizes the need to re-calculate distances for the majority of
vectors in a new partition. We follow this up with reassignment step on vectors
from the deleted centroid postings to maintain the NPA property.
2. **Create Graph for the New Centroids:** Before we split/merge partitions
or reassign vectors, we need an efficient way to identify nearest centroids. To
do this, we create an HNSW graph for the centroids remaining after postings in
each cluster were merged. We may create more centroids if postings are split.
These can simply be added to the HNSW graph.
3. **Reassign Vectors in Merged Postings:** Inspired by the SPFresh paper
[2], we don't need to touch the vectors that remained in their original
posting. Our "candidates vectors" are only the ones that were appended to the
largest posting. We pick `k` postings nearest to each centroid as "target
postings" for the reassignment. This `k` is a configurable parameter called
`reassign_range`. Recall improves as we increase it; SPFresh [2] suggests
`reassign_range=64` works well.
4. **Split Large Postings:** Post reassignment, we verify that each posting
is within the max configured size limit. This is important to contain tail
latency. Larger postings must be split into more partitions. No. of split
partitions is determined based on the max allowed posting size, and required
no. of centroids are determined using k-means clustering.
Vectors are then reassigned to these centroids. When assigning vectors
to the new post-split centroids, we also compare their distance to k
`(reassign_range)` nearby postings. This ensures the vectors are assigned to
their nearest partition, in case a neighbor was closer than the newly created
centroid. All newly created centroids are also added to the centroid HNSW graph.
5. **(Optional) Merge Small Postings:** We now have a new segment with
postings that are all within the max allowed partition size. Postings that are
too small (`< min_partition_size`), can now be merged into one segment, using
the steps described above.
We want to ensure that we don't flip-flop endlessly between split and merge
steps. This can be handled with heuristics like only merging if the final
posting size is `< max_partition_size`.
...
This outlines my initial thoughts for merging segments. I suspect with
benchmarking, we can find incremental improvements, like heuristics that
contain the no. of vectors considered for reassignment. I've intentionally
skipped details on how to organize and fetch postings from disk, hoping
existing Lucene primitives can help us here.
Looking forward to hearing from experts in the community who've thought
about or tried similar approaches elsewhere.
**References:**
[1] [SPANN Paper](https://arxiv.org/pdf/2111.08566)
[2] [SPFresh Paper](https://arxiv.org/pdf/2410.14452)
[3] [An older RFC with a SPANN specific
proposal](https://github.com/apache/lucene/issues/14997)
[4] [PlanetScale Blog about a similarly inspired
setup](https://planetscale.com/blog/larger-than-ram-vector-indexes-for-relational-databases)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]