Hi all, I am working on a query that takes a set of terms, finds all documents containing at least one of those terms, computes a subset of candidate docs with the most matching terms, and applies a user-provided scoring function to each of the candidate docs
Simple example of the query: - query terms ("aaa", "bbb") - indexed docs with terms: docId 0 has terms ("aaa", "bbb") docId 1 has terms ("aaa", "ccc") - number of top candidates = 1 - simple scoring function score(docId) = docId + 10 The query first builds a count array [2, 1], because docId 0 contains two matching terms and docId 1 contains 1 matching term. Then it picks docId 0 as the candidate subset. Then it applies the scoring function, returning a score of 10 for docId 0. The main bottleneck right now is doing the initial counting, i.e. the part that returns the [2, 1] array. I first started by using a BoolQuery containing a Should clause for every Term, so the returned score was the count. This was simple but very slow. Then I got a substantial speedup by copying and modifying the TermInSetQuery so that it tracks the number of times each docId contains a query term. The main construct here seems to be PrefixCodedTerms. At this point I'm not sure if there's any faster construct, or perhaps a more optimal way to use PrefixCodedTerms? Here is the specific query, highlighting some specific parts of the code: - Build the PrefixCodedTerms (in my case the terms are called 'hashes'): https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 - Count the matching terms in a segment (this is the main bottleneck in my query): https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 I appreciate any suggestions you might have. - Alex