[ https://issues.apache.org/jira/browse/SOLR-14397 ]
Steven Chou deleted comment on SOLR-14397:
------------------------------------
was (Author: sing10407):
I would like to propose an new approach:
1. Leverage [SuperBit|https://github.com/tdebatty/java-LSH] hash to
de-dimension the vector from double array to a boolean array
* It will keep the similarity. [According to
this.|https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/examples/SuperBitExample.java#L58]
* We should leverage the DocValues
2. According to the [new similarity scoring
function|https://github.com/tdebatty/java-LSH/blob/master/src/main/java/info/debatty/java/lsh/SuperBit.java#L197],
the original implementation is for comparison with cosine similarity. So if we
want to leverage this for new scoring and keep the documents in order, we can
simplify the algorithm:
# De-dimension the vectors to a boolean array by SuperBit. (Stored as a binary
array in DocValues)
# Count the identical bit in each digit as the score.
Simplify:
# Use [BitSet|https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html]
in Java, convert both binary array vector into BitSet.
## Use bitwise "BitSet.and()" operation, then the digits with the same bit
will be true.
## Use
"BitSet.[cardinality|https://docs.oracle.com/javase/8/docs/api/java/util/BitSet.html#cardinality--]()"
to get the count of true
## Done.
# The BitSet is memory efficient, [see this
discussion|https://stackoverflow.com/questions/605226/boolean-vs-bitset-which-is-more-efficient].
3. Additional:
# Filter similarity score > .8 (In this case, cardinality > k), to make the
recall and precision keep at a good level. If the SuperBit keeps the similarity
good, the recall and precision will be excellent.
Note: *this approach will compute the score with all documents.* Since the
calculation cost is low, and memory consumption is low, we can leverage the
field attribute _docValuesFormat="Memory"_. Furthermore, the DocValues should
support the BinaryDocValues and all vectors should be hashed by SuperBit and
converted into binary format in order to store as BinaryDocValues.
ref: A Revisit of Hashing Algorithms for Approximate
Nearest Neighbor Search https://arxiv.org/pdf/1612.07545.pdf
> Vector Search in Solr
> ---------------------
>
> Key: SOLR-14397
> URL: https://issues.apache.org/jira/browse/SOLR-14397
> Project: Solr
> Issue Type: Improvement
> Reporter: Trey Grainger
> Priority: Major
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Search engines have traditionally relied upon token-based matching (typically
> keywords) on an inverted index, plus relevance ranking based upon keyword
> occurrence statistics. This can be viewed as a "sparse vector” match (where
> each term is a one-hot encoded dimension in the vector), since only a few
> keywords out of all possible keywords are considered in each query. With the
> introduction of deep-learning-based transformers over the last few years,
> however, the state of the art in relevance has moved to ranking models based
> upon dense vectors that encode a latent, semantic understanding of both
> language constructs and the underlying domain upon which the model was
> trained. These dense vectors are also referred to as “embeddings”. An example
> of this kind of embedding would be taking the phrase “chief executive officer
> of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3]
> . Other similar phrases should encode to vectors with very similar numbers,
> so we may expect a query like “CEO of a technology org” to generate a vector
> like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity
> calculation between these vectors, we would expect a number closer to 1.0,
> whereas a very unrelated text blurb would generate a much smaller cosine
> similarity.
> This is a proposal for how we should implement these vector search
> capabilities in Solr.
> h1. Search Process Overview:
> In order to implement dense vector search, the following process is typically
> followed:
> h2. Offline:
> An encoder is built. An encoder can take in text (a query, a sentence, a
> paragraph, a document, etc.) and return a dense vector representing that
> document in a rich semantic space. The semantic space is learned from
> training on textual data (usually, though other sources work, too), typically
> from the domain of the search engine.
> h2. Document Ingestion:
> When documents are processed, they are passed to the encoder, and the dense
> vector(s) returned are stored as fields on the document. There could be one
> or more vectors per-document, as the granularity of the vectors could be
> per-document, per field, per paragraph, per-sentence, or even per phrase or
> per term.
> h2. Query Time:
> *Encoding:* The query is translated to a dense vector by passing it to the
> encoder
> Quantization: The query is quantized. Quantization is the process of taking
> a vector with many values and turning it into “terms” in a vector space that
> approximates the full vector space of the dense vectors.
> *ANN Matching:* A query on the quantized vector tokens is executed as an ANN
> (approximate nearest neighbor) search. This allows finding most of the best
> matching documents (typically up to 95%) with a traditional and efficient
> lookup against the inverted index.
> _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched
> quantized tokens to get a rough, initial ranking of documents based upon the
> similarity of the query and document vectors. This allows the next step
> (re-ranking) to be performed on a smaller subset of documents.
> *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is
> performed, a similarity calculation (cosine, dot-product, or any number of
> other calculations) is typically performed between the full (non-quantized)
> dense vectors for the query and those in the document. This re-ranking will
> typically be on the top-N results for performance reasons.
> *Return Results:* As with any search, the final step is typically to return
> the results in relevance-ranked order. In this case, that would be sorted by
> the re-ranking similarity score (i.e. “cosine descending”).
> ------
> *Variant:* For small document sets, it may be preferable to rank all
> documents and skip steps steps 2, 3, and 4. This is because ANN Matching
> typically reduces recall (current state of the art is around 95% recall), so
> it can be beneficial to rank all documents if performance is not a concern.
> In this case, step 5 is performed on the full doc set and would obviously
> just be considered “ranking” instead of “re-”ranking.
> h1. Proposed Implementation in Solr:
> h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity
> *
> h3. Dense Vector Field:
> We will add a new dense vector field type in Solr. This field type would be a
> compressed encoding of a dense vector into a BinaryDocValues Field. There are
> other ways to do it, but this is almost certain to be the most efficient.
> Ideally this field is multi-valued. If it is single-valued then we are
> either limited to only document-level vectors, or otherwise we have to create
> many vector fields (i.e. per paragraph or term) and search across them, which
> will never be practical or scale well. BinaryDocValues does not natively
> support multiple values, but since the Solr field type will need to be
> responsible for encoding the numeric vectors as a byte[] it should also be
> able to encode multiple vectors at the same time.
> *
> h3. Vector Similarity Value Sources:
> Implement function queries (value sources) that take in a query vector, a
> (multi-valued) vector field name, and a multi-value selector ("max", "min"
> ,"avg", "first"), and return the computed calculation. This function could
> then be used to score documents via the “func” query parser. ANN queries
> could be implemented using separate “fq” params for filtering, or as part of
> the “q” query parameter to support optional ANN Ranking, and then the
> existing Solr re-ranking functionality could be used for re-ranking the top N
> results using the “func” query parser and these new cosine or dot product
> value sources as the re-rank query. Existing Solr logic for performing
> two-phase iteration will ensure that these vector functions will not be
> excessively computed for documents that do not already match the “cheaper”
> ANN queries.
> *Example*:
> {noformat}
> q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, first){noformat}
> ^ simple, no-ANN. Third param is optional and is the function
> (min|max|avg|first|last). I think this should default to "max" (or whatever
> "most related" is for each similarity function. It's "max" for cosine and dot
> product).
> *Example:*
> {noformat}
> q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field,
> first)&fq=[ANN_QUERY]{noformat}
> ^ ensures function is not computed unless ANN query matches due to two-phase
> iterator
> *Example:*
> {noformat}
> q=[ANN_QUERY]&rq={!rerank reRankQuery=$rqq reRankDocs=10000
> reRankWeight=1000000}&rqq={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33",
> vectors_field){noformat}
> ^Supports an initial ranking based upon the ANN QUERY, followed by a
> re-ranking based upon the cosine function.
> The key implementation challenges are expected to be:
> * ensuring decent query (speed) performance
> * Determining an efficient vector/byte encoding that has decent
> disk-space/query time decoding trade offs and makes it possible to decode
> only the ‘first’ vector per doc in cases where that is all that’s requested
> by the distance function
> h2. Phase 2: Pluggable ANN Search
> * Implement pluggable quantization/ANN search. There are multiple different
> algorithms to do quantization for efficient ANN search. The simplest is
> something like LSH, with more advanced implementations being worked on in
> Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a
> prototyped LSH implementation (Solr Vector Search plugin) and a random
> projections approach (Hangry) linked to in SOLR-12890.
> * Given that the transformers/encoders are a rapidly evolving area of data
> science and NLP research, it feels unreasonable to assume that Solr will be
> able to keep up to date on the latest model integrations. Given that, if we
> build support into Solr for encoder models, we need to make it very pluggable.
> * Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus
> for external systems to accomplish quantization by creating quantized terms
> externally, indexing them to Solr, and then sending in queries to Solr with
> the quantized terms. I’m not convinced that keeping quantization outside of
> Lucene/Solr isn’t a more manageable model long term, but if certain models do
> get integrated natively into Lucene (as is currently being worked on) then we
> should obviously look to leverage them.
> h2. Important Considerations:
> * Multi-valued dense vectors are a key differentiating feature here. I’m
> considering this a key requirement and designing accordingly. This will allow
> for multiple embeddings per document (i.e. “word embeddings, sentence
> embeddings, paragraph embeddings, etc.) to be modeled. I like the way
> [~ehatcher] designed the the Payload Function Query and think we should model
> the multi-valued handling similarly.
> * As best as possible, it would be nice to support multiple (pluggable)
> fields/value sources that can be converted to dense vectors and compared. At
> a minimum we need an efficient dense vector field, but it may be beneficial
> to support a sparse vector field where only meaningful values need to be
> encoded, in order to save space. Additionally, there may be some utility in
> other field types or arbitrary value sources that can product vector-like
> output to be able to be leveraged in the function queries being proposed.
> h1. Note on Prior Approaches
> * There are a handful of ways to perform vector search today with Solr. One
> is through Streaming Expressions, and alternatively there are a few plugins
> which have implemented this functionality for Solr. One of these was
> contributed in SOLR-12890, and I've outlined examples of most of these
> approaches in the comments on SOLR-12890.
> * The Streaming Expressions approach works well out of the box, but doesn't
> integrate nicely with traditional keyword search use cases, so we need a
> solution that integrates with real-time queries and enables the full suite of
> Solr's query-time features.
> * The most advanced Solr plugin out there seems to be this one:
> https://github.com/moshebla/solr-vector-scoring (forked the version in
> SOLR-12890 and then added LSH functionality for ANN).
> * I encountered several challenges when working with the implementation in
> SOLR-12890 which informed the above proposed design:
> 1. The plugin encodes dense vectors into payloads attached to terms
> representing vector positions. This is pretty inefficient and is way too slow
> at significant scale. I believe we instead need the underlying field types to
> be binary doc values field with efficient compression and decoding for
> optimal performance when scanning through all the matching documents and
> dimensions.
> 2. The plugin only supports one vector per-field, making supporting
> word/phrase vectors, sentence vectors, paragraph vectors, etc. challenging.
> It's not obvious how to modify the current approach to support multiple
> vectors per field.
> 3. The plugin uses a query parser instead of just a function query, which
> prevents re-use of the vector similarity function. Using a function query
> instead will enable the vector similarity scores to be easily leveraged for
> relevance scoring, sorting, returned as fields, or used within other function
> queries. Given that there are two goals of 1) filtering on ANN, and 2)
> Scoring on vector similarity (usually via re-ranking), using a function query
> for the scoring piece will be more flexible across use scoring use cases
> (combining with other functions, returning, sorting, etc.)
> 4. The LSH quantization/ANN implementation is a cool idea, but it is
> hard-coded in the implementation. I recommend making it a separate filter and
> making the implementation more pluggable (though for now this can be
> externalized and passed in to Solr). We may be able to pull the LSH work in
> during phase 2.
> For cleaner discussion and tracking, I'm moving this revised proposal to this
> new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the
> previously contributed plugin implementation.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]