Hello all, I'm working on an Elasticsearch plugin (using Lucene internally) that allows users to index numerical vectors and run exact and approximate k-nearest-neighbors similarity queries. I'd like to get some feedback about my usage of BooleanQueries and TermQueries, and see if there are any optimizations or performance tricks for my use case.
An example use case for the plugin is reverse image search. A user can store vectors representing images and run a nearest-neighbors query to retrieve the 10 vectors with the smallest L2 distance to a query vector. More detailed documentation here: http://elastiknn.klibisz.com/ The main method for indexing the vectors is based on Locality Sensitive Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. The general pattern is: 1. When indexing a vector, apply a hash function to it, producing a set of discrete hashes. Usually there are anywhere from 100 to 1000 hashes. Similar vectors are more likely to share hashes (i.e., similar vectors produce hash collisions). 2. Convert each hash to a byte array and store the byte array as a Lucene Term at a specific field. 3. Store the complete vector (i.e. floating point numbers) in a binary doc values field. In other words, I'm converting each vector into a bag of words, though the words have no semantic meaning. A query works as follows: 1. Given a query vector, apply the same hash function to produce a set of hashes. 2. Convert each hash to a byte array and create a Term. 3. Build and run a BooleanQuery with a clause for each Term. Each clause looks like this: `new BooleanClause(new ConstantScoreQuery(new TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), BooleanClause.Occur.SHOULD))`. 4. As the BooleanQuery produces results, maintain a fixed-size heap of its scores. For any score exceeding the min in the heap, load its vector from the binary doc values, compute the exact similarity, and update the heap. Otherwise the vector gets a score of 0. When profiling my benchmarks with VisualVM, I've found the Elasticsearch search threads spend > 50% of the runtime in these two methods: - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime) - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of runtime) So the time seems to be dominated by collecting and ordering the results produced by the BooleanQuery from step 3 above. The exact similarity computation is only about 15% of the runtime. If I disable it entirely, I still see the same bottlenecks in VisualVM. Reducing the number of hashes yields roughly linear scaling (i.e., 400 hashes take ~2x longer than 200 hashes). The use case seems different to text search in that there's no semantic meaning to the terms, their length, their ordering, their stems, etc. I basically just need the index to be a rudimentary HashMap, and I only care about the scores for the top k results. With that in mind, I've made the following optimizations: - Disabled tokenization on the FieldType (setTokenized(false)) - Disabled norms on the FieldType (setOmitNorms(true)) - Set similarity to BooleanSimilarity on the elasticsearch MappedFieldType - Set index options to IndexOptions.Docs. - Used the MoreLikeThis heuristic to pick a subset of terms. This understandably only yields a speedup proportional to the number of discarded terms. I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0. The main query implementation is here <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> . <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> The actual query that gets executed by Elasticsearch is instantiated on line 98 <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98> . It's in Scala but all of the Java query classes should look familiar. Maybe there are some settings that I'm not aware of? Maybe I could optimize this by implementing a custom query or scorer? Maybe there's just no way to speed this up? I appreciate any input, examples, links, etc.. :) Also, let me know if I can provide any additional details. Thanks, Alex Klibisz