[jira] (SOLR-14397) Vector Search in Solr

Steven Chou (Jira) Wed, 30 Mar 2022 04:13:05 -0700


    [ https://issues.apache.org/jira/browse/SOLR-14397 ]



    Steven Chou deleted comment on SOLR-14397:
    ------------------------------------

was (Author: sing10407):
I would like to propose an new approach:

1. Leverage [SuperBit|https://github.com/tdebatty/java-LSH] hash to 
de-dimension the vector from double array to a boolean array
 * It will keep the similarity. [According to 
this.|https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/examples/SuperBitExample.java#L58]
 * We should leverage the DocValues

2. According to the [new similarity scoring 
function|https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L197],
 the original implementation is for comparison with cosine similarity. So if we 
want to leverage this for new scoring and keep the documents in order, we can 
simplify the algorithm:
 # De-dimension the vectors to a boolean array by SuperBit. (Stored as a binary 
array in DocValues)
 # Count the identical bit in each digit as the score.

Simplify:
 # Use [BitSet|https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html] 
in Java, convert both binary array vector into BitSet.
 ## Use bitwise "BitSet.and()" operation, then the digits with the same bit 
will be true.
 ## Use 
"BitSet.[cardinality|https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#cardinality--]()"
 to get the count of true
 ## Done.
 # The BitSet is memory efficient, [see this 
discussion|https://stackoverflow.com/questions/605226/boolean-vs-bitset-which-is-more-efficient].

3. Additional:
 # Filter similarity score > .8 (In this case, cardinality > k), to make the 
recall and precision keep at a good level. If the SuperBit keeps the similarity 
good, the recall and precision will be excellent.

Note: *this approach will compute the score with all documents.* Since the 
calculation cost is low, and memory consumption is low, we can leverage the 
field attribute _docValuesFormat="Memory"_. Furthermore, the DocValues should 
support the BinaryDocValues and all vectors should be hashed by SuperBit and 
converted into binary format in order to store as BinaryDocValues.

ref: A Revisit of Hashing Algorithms for Approximate
Nearest Neighbor Search https://arxiv.org/pdf/1612.07545.pdf



> Vector Search in Solr
> ---------------------
>
>                 Key: SOLR-14397
>                 URL: https://issues.apache.org/jira/browse/SOLR-14397
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Trey Grainger
>            Priority: Major
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Search engines have traditionally relied upon token-based matching (typically 
> keywords) on an inverted index, plus relevance ranking based upon keyword 
> occurrence statistics. This can be viewed as a "sparse vector” match (where 
> each term is a one-hot encoded dimension in the vector), since only a few 
> keywords out of all possible keywords are considered in each query. With the 
> introduction of deep-learning-based transformers over the last few years, 
> however, the state of the art in relevance has moved to ranking models based 
> upon dense vectors that encode a latent, semantic understanding of both 
> language constructs and the underlying domain upon which the model was 
> trained. These dense vectors are also referred to as “embeddings”. An example 
> of this kind of embedding would be taking the phrase “chief executive officer 
> of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
>  . Other similar phrases should encode to vectors with very similar numbers, 
> so we may expect a query like “CEO of a technology org” to generate a vector 
> like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity 
> calculation between these vectors, we would expect a number closer to 1.0, 
> whereas a very unrelated text blurb would generate a much smaller cosine 
> similarity.
> This is a proposal for how we should implement these vector search 
> capabilities in Solr.
> h1. Search Process Overview:
> In order to implement dense vector search, the following process is typically 
> followed:
> h2. Offline:
> An encoder is built. An encoder can take in text (a query, a sentence, a 
> paragraph, a document, etc.) and return a dense vector representing that 
> document in a rich semantic space. The semantic space is learned from 
> training on textual data (usually, though other sources work, too), typically 
> from the domain of the search engine.
> h2. Document Ingestion:
> When documents are processed, they are passed to the encoder, and the dense 
> vector(s) returned are stored as fields on the document. There could be one 
> or more vectors per-document, as the granularity of the vectors could be 
> per-document, per field, per paragraph, per-sentence, or even per phrase or 
> per term.
> h2. Query Time:
> *Encoding:* The query is translated to a dense vector by passing it to the 
> encoder
>  Quantization: The query is quantized. Quantization is the process of taking 
> a vector with many values and turning it into “terms” in a vector space that 
> approximates the full vector space of the dense vectors.
>  *ANN Matching:* A query on the quantized vector tokens is executed as an ANN 
> (approximate nearest neighbor) search. This allows finding most of the best 
> matching documents (typically up to 95%) with a traditional and efficient 
> lookup against the inverted index.
>  _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched 
> quantized tokens to get a rough, initial ranking of documents based upon the 
> similarity of the query and document vectors. This allows the next step 
> (re-ranking) to be performed on a smaller subset of documents. 
>  *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is 
> performed, a similarity calculation (cosine, dot-product, or any number of 
> other calculations) is typically performed between the full (non-quantized) 
> dense vectors for the query and those in the document. This re-ranking will 
> typically be on the top-N results for performance reasons.
>  *Return Results:* As with any search, the final step is typically to return 
> the results in relevance-ranked order. In this case, that would be sorted by 
> the re-ranking similarity score (i.e. “cosine descending”).
>  ------
> *Variant:* For small document sets, it may be preferable to rank all 
> documents and skip steps steps 2, 3, and 4. This is because ANN Matching 
> typically reduces recall (current state of the art is around 95% recall), so 
> it can be beneficial to rank all documents if performance is not a concern. 
> In this case, step 5 is performed on the full doc set and would obviously 
> just be considered “ranking” instead of “re-”ranking.
> h1. Proposed Implementation in Solr:
> h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
>  * 
> h3. Dense Vector Field:
> We will add a new dense vector field type in Solr. This field type would be a 
> compressed encoding of a dense vector into a BinaryDocValues Field. There are 
> other ways to do it, but this is almost certain to be the most efficient.
>  Ideally this field is multi-valued. If it is single-valued then we are 
> either limited to only document-level vectors, or otherwise we have to create 
> many vector fields (i.e. per paragraph or term) and search across them, which 
> will never be practical or scale well. BinaryDocValues does not natively 
> support multiple values, but since the Solr field type will need to be 
> responsible for encoding the numeric vectors as a byte[] it should also be 
> able to encode multiple vectors at the same time.
>  * 
> h3. Vector Similarity Value Sources:
> Implement function queries (value sources) that take in a query vector, a 
> (multi-valued) vector field name, and a multi-value selector ("max", "min" 
> ,"avg", "first"), and return the computed calculation. This function could 
> then be used to score documents via the “func” query parser. ANN queries 
> could be implemented using separate “fq” params for filtering, or as part of 
> the “q” query parameter to support optional ANN Ranking, and then the 
> existing Solr re-ranking functionality could be used for re-ranking the top N 
> results using the “func” query parser and these new cosine or dot product 
> value sources as the re-rank query. Existing Solr logic for performing 
> two-phase iteration will ensure that these vector functions will not be 
> excessively computed for documents that do not already match the “cheaper” 
> ANN queries.
> *Example*:
> {noformat}
> q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, first){noformat}
> ^ simple, no-ANN. Third param is optional and is the function 
> (min|max|avg|first|last). I think this should default to "max" (or whatever 
> "most related" is for each similarity function. It's "max" for cosine and dot 
> product).
> *Example:*
> {noformat}
> q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, 
> first)&fq=[ANN_QUERY]{noformat}
> ^ ensures function is not computed unless ANN query matches due to two-phase 
> iterator
> *Example:*
> {noformat}
> q=[ANN_QUERY]&rq={!rerank reRankQuery=$rqq reRankDocs=10000 
> reRankWeight=1000000}&rqq={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", 
> vectors_field){noformat}
> ^Supports an initial ranking based upon the ANN QUERY, followed by a 
> re-ranking based upon the cosine function.
> The key implementation challenges are expected to be:
>  * ensuring decent query (speed) performance
>  * Determining an efficient vector/byte encoding that has decent 
> disk-space/query time decoding trade offs and makes it possible to decode 
> only the ‘first’ vector per doc in cases where that is all that’s requested 
> by the distance function
> h2. Phase 2: Pluggable ANN Search
>  * Implement pluggable quantization/ANN search. There are multiple different 
> algorithms to do quantization for efficient ANN search. The simplest is 
> something like LSH, with more advanced implementations being worked on in 
> Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a 
> prototyped LSH implementation (Solr Vector Search plugin) and a random 
> projections approach (Hangry) linked to in SOLR-12890.
>  * Given that the transformers/encoders are a rapidly evolving area of data 
> science and NLP research, it feels unreasonable to assume that Solr will be 
> able to keep up to date on the latest model integrations. Given that, if we 
> build support into Solr for encoder models, we need to make it very pluggable.
>  * Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus 
> for external systems to accomplish quantization by creating quantized terms 
> externally, indexing them to Solr, and then sending in queries to Solr with 
> the quantized terms. I’m not convinced that keeping quantization outside of 
> Lucene/Solr isn’t a more manageable model long term, but if certain models do 
> get integrated natively into Lucene (as is currently being worked on) then we 
> should obviously look to leverage them.
> h2. Important Considerations:
>  * Multi-valued dense vectors are a key differentiating feature here. I’m 
> considering this a key requirement and designing accordingly. This will allow 
> for multiple embeddings per document (i.e. “word embeddings, sentence 
> embeddings, paragraph embeddings, etc.) to be modeled. I like the way 
> [~ehatcher] designed the the Payload Function Query and think we should model 
> the multi-valued handling similarly.
>  * As best as possible, it would be nice to support multiple (pluggable) 
> fields/value sources that can be converted to dense vectors and compared. At 
> a minimum we need an efficient dense vector field, but it may be beneficial 
> to support a sparse vector field where only meaningful values need to be 
> encoded, in order to save space. Additionally, there may be some utility in 
> other field types or arbitrary value sources that can product vector-like 
> output to be able to be leveraged in the function queries being proposed.
> h1. Note on Prior Approaches
>  * There are a handful of ways to perform vector search today with Solr. One 
> is through Streaming Expressions, and alternatively there are a few plugins 
> which have implemented this functionality for Solr. One of these was 
> contributed in SOLR-12890, and I've outlined examples of most of these 
> approaches in the comments on SOLR-12890.
>  * The Streaming Expressions approach works well out of the box, but doesn't 
> integrate nicely with traditional keyword search use cases, so we need a 
> solution that integrates with real-time queries and enables the full suite of 
> Solr's query-time features.
>  * The most advanced Solr plugin out there seems to be this one: 
> https://github.com/moshebla/solr-vector-scoring (forked the version in 
> SOLR-12890 and then added LSH functionality for ANN).
>  * I encountered several challenges when working with the implementation in 
> SOLR-12890 which informed the above proposed design:
> 1. The plugin encodes dense vectors into payloads attached to terms 
> representing vector positions. This is pretty inefficient and is way too slow 
> at significant scale. I believe we instead need the underlying field types to 
> be binary doc values field with efficient compression and decoding for 
> optimal performance when scanning through all the matching documents and 
> dimensions.
> 2. The plugin only supports one vector per-field, making supporting 
> word/phrase vectors, sentence vectors, paragraph vectors, etc. challenging. 
> It's not obvious how to modify the current approach to support multiple 
> vectors per field.
> 3. The plugin uses a query parser instead of just a function query, which 
> prevents re-use of the vector similarity function. Using a function query 
> instead will enable the vector similarity scores to be easily leveraged for 
> relevance scoring, sorting, returned as fields, or used within other function 
> queries. Given that there are two goals of 1) filtering on ANN, and 2) 
> Scoring on vector similarity (usually via re-ranking), using a function query 
> for the scoring piece will be more flexible across use scoring use cases 
> (combining with other functions, returning, sorting, etc.)
> 4.  The LSH quantization/ANN implementation is a cool idea, but it is 
> hard-coded in the implementation. I recommend making it a separate filter and 
> making the implementation more pluggable (though for now this can be 
> externalized and passed in to Solr). We may be able to pull the LSH work in 
> during phase 2.
> For cleaner discussion and tracking, I'm moving this revised proposal to this 
> new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the 
> previously contributed plugin implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] (SOLR-14397) Vector Search in Solr

Reply via email to