You might consider using a TermInSetQuery in place of a BooleanQuery for the hashes (since they are all in the same field).
I don't really understand why you are seeing so much cost in the heap - it's sounds as if you have a single heap with mixed scores - those generated by the BooleanQuery and those generated by the vector scoring operation. Maybe you comment a little more on the interaction there - are there really two heaps? Do you override the standard collector? On Tue, Jun 23, 2020 at 9:51 AM Alex K <aklib...@gmail.com> wrote: > > Hello all, > > I'm working on an Elasticsearch plugin (using Lucene internally) that > allows users to index numerical vectors and run exact and approximate > k-nearest-neighbors similarity queries. > I'd like to get some feedback about my usage of BooleanQueries and > TermQueries, and see if there are any optimizations or performance tricks > for my use case. > > An example use case for the plugin is reverse image search. A user can > store vectors representing images and run a nearest-neighbors query to > retrieve the 10 vectors with the smallest L2 distance to a query vector. > More detailed documentation here: http://elastiknn.klibisz.com/ > > The main method for indexing the vectors is based on Locality Sensitive > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. > The general pattern is: > > 1. When indexing a vector, apply a hash function to it, producing a set > of discrete hashes. Usually there are anywhere from 100 to 1000 hashes. > Similar vectors are more likely to share hashes (i.e., similar vectors > produce hash collisions). > 2. Convert each hash to a byte array and store the byte array as a > Lucene Term at a specific field. > 3. Store the complete vector (i.e. floating point numbers) in a binary > doc values field. > > In other words, I'm converting each vector into a bag of words, though the > words have no semantic meaning. > > A query works as follows: > > 1. Given a query vector, apply the same hash function to produce a set > of hashes. > 2. Convert each hash to a byte array and create a Term. > 3. Build and run a BooleanQuery with a clause for each Term. Each clause > looks like this: `new BooleanClause(new ConstantScoreQuery(new > TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), > BooleanClause.Occur.SHOULD))`. > 4. As the BooleanQuery produces results, maintain a fixed-size heap of > its scores. For any score exceeding the min in the heap, load its vector > from the binary doc values, compute the exact similarity, and update the > heap. Otherwise the vector gets a score of 0. > > When profiling my benchmarks with VisualVM, I've found the Elasticsearch > search threads spend > 50% of the runtime in these two methods: > > - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime) > - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of > runtime) > > So the time seems to be dominated by collecting and ordering the results > produced by the BooleanQuery from step 3 above. > The exact similarity computation is only about 15% of the runtime. If I > disable it entirely, I still see the same bottlenecks in VisualVM. > Reducing the number of hashes yields roughly linear scaling (i.e., 400 > hashes take ~2x longer than 200 hashes). > > The use case seems different to text search in that there's no semantic > meaning to the terms, their length, their ordering, their stems, etc. > I basically just need the index to be a rudimentary HashMap, and I only > care about the scores for the top k results. > With that in mind, I've made the following optimizations: > > - Disabled tokenization on the FieldType (setTokenized(false)) > - Disabled norms on the FieldType (setOmitNorms(true)) > - Set similarity to BooleanSimilarity on the elasticsearch > MappedFieldType > - Set index options to IndexOptions.Docs. > - Used the MoreLikeThis heuristic to pick a subset of terms. This > understandably only yields a speedup proportional to the number of > discarded terms. > > I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0. > The main query implementation is here > <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> > . > <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> > The actual query that gets executed by Elasticsearch is instantiated on line > 98 > <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98> > . > It's in Scala but all of the Java query classes should look familiar. > > Maybe there are some settings that I'm not aware of? > Maybe I could optimize this by implementing a custom query or scorer? > Maybe there's just no way to speed this up? > > I appreciate any input, examples, links, etc.. :) > Also, let me know if I can provide any additional details. > > Thanks, > Alex Klibisz --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org