Re: Optimizing a boolean query for 100s of term clauses

Michael Sokolov Tue, 23 Jun 2020 11:07:03 -0700

You might consider using a TermInSetQuery in place of a BooleanQuery
for the hashes (since they are all in the same field).


I don't really understand why you are seeing so much cost in the heap
- it's sounds as if you have a single heap with mixed scores - those
generated by the BooleanQuery and those generated by the vector
scoring operation. Maybe you comment a little more on the interaction
there - are there really two heaps? Do you override the standard
collector?

On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklib...@gmail.com> wrote:
>
> Hello all,
>
> I'm working on an Elasticsearch plugin (using Lucene internally) that
> allows users to index numerical vectors and run exact and approximate
> k-nearest-neighbors similarity queries.
> I'd like to get some feedback about my usage of BooleanQueries and
> TermQueries, and see if there are any optimizations or performance tricks
> for my use case.
>
> An example use case for the plugin is reverse image search. A user can
> store vectors representing images and run a nearest-neighbors query to
> retrieve the 10 vectors with the smallest L2 distance to a query vector.
> More detailed documentation here: http://elastiknn.klibisz.com/
>
> The main method for indexing the vectors is based on Locality Sensitive
> Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> The general pattern is:
>
>    1. When indexing a vector, apply a hash function to it, producing a set
>    of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
>    Similar vectors are more likely to share hashes (i.e., similar vectors
>    produce hash collisions).
>    2. Convert each hash to a byte array and store the byte array as a
>    Lucene Term at a specific field.
>    3. Store the complete vector (i.e. floating point numbers) in a binary
>    doc values field.
>
> In other words, I'm converting each vector into a bag of words, though the
> words have no semantic meaning.
>
> A query works as follows:
>
>    1. Given a query vector, apply the same hash function to produce a set
>    of hashes.
>    2. Convert each hash to a byte array and create a Term.
>    3. Build and run a BooleanQuery with a clause for each Term. Each clause
>    looks like this: `new BooleanClause(new ConstantScoreQuery(new
>    TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
>    BooleanClause.Occur.SHOULD))`.
>    4. As the BooleanQuery produces results, maintain a fixed-size heap of
>    its scores. For any score exceeding the min in the heap, load its vector
>    from the binary doc values, compute the exact similarity, and update the
>    heap. Otherwise the vector gets a score of 0.
>
> When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> search threads spend > 50% of the runtime in these two methods:
>
>    - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
>    - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
>    runtime)
>
> So the time seems to be dominated by collecting and ordering the results
> produced by the BooleanQuery from step 3 above.
> The exact similarity computation is only about 15% of the runtime. If I
> disable it entirely, I still see the same bottlenecks in VisualVM.
> Reducing the number of hashes yields roughly linear scaling (i.e., 400
> hashes take ~2x longer than 200 hashes).
>
> The use case seems different to text search in that there's no semantic
> meaning to the terms, their length, their ordering, their stems, etc.
> I basically just need the index to be a rudimentary HashMap, and I only
> care about the scores for the top k results.
> With that in mind, I've made the following optimizations:
>
>    - Disabled tokenization on the FieldType (setTokenized(false))
>    - Disabled norms on the FieldType (setOmitNorms(true))
>    - Set similarity to BooleanSimilarity on the elasticsearch
>    MappedFieldType
>    - Set index options to IndexOptions.Docs.
>    - Used the MoreLikeThis heuristic to pick a subset of terms. This
>    understandably only yields a speedup proportional to the number of
>    discarded terms.
>
> I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
> The main query implementation is here
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
> .
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
> The actual query that gets executed by Elasticsearch is instantiated on line
> 98
> <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
> .
> It's in Scala but all of the Java query classes should look familiar.
>
> Maybe there are some settings that I'm not aware of?
> Maybe I could optimize this by implementing a custom query or scorer?
> Maybe there's just no way to speed this up?
>
> I appreciate any input, examples, links, etc.. :)
> Also, let me know if I can provide any additional details.
>
> Thanks,
> Alex Klibisz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Optimizing a boolean query for 100s of term clauses

Reply via email to