Thanks Ali. I don't think that will work in this case, since the data I'm counting is managed by lucene, but that looks like an interesting project. -Alex
On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote: > I'm new to lucene so I'm not sure what the best way of speeding this up in > Lucene is, but I've previously used https://github.com/npgall/cqengine for > similar stuff. It provided really good performance, especially if you're > just counting things. > > On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklib...@gmail.com> wrote: > > > Hi all, > > > > I am working on a query that takes a set of terms, finds all documents > > containing at least one of those terms, computes a subset of candidate > docs > > with the most matching terms, and applies a user-provided scoring > function > > to each of the candidate docs > > > > Simple example of the query: > > - query terms ("aaa", "bbb") > > - indexed docs with terms: > > docId 0 has terms ("aaa", "bbb") > > docId 1 has terms ("aaa", "ccc") > > - number of top candidates = 1 > > - simple scoring function score(docId) = docId + 10 > > The query first builds a count array [2, 1], because docId 0 contains two > > matching terms and docId 1 contains 1 matching term. > > Then it picks docId 0 as the candidate subset. > > Then it applies the scoring function, returning a score of 10 for docId > 0. > > > > The main bottleneck right now is doing the initial counting, i.e. the > part > > that returns the [2, 1] array. > > > > I first started by using a BoolQuery containing a Should clause for every > > Term, so the returned score was the count. This was simple but very slow. > > Then I got a substantial speedup by copying and modifying the > > TermInSetQuery so that it tracks the number of times each docId contains > a > > query term. The main construct here seems to be PrefixCodedTerms. > > > > At this point I'm not sure if there's any faster construct, or perhaps a > > more optimal way to use PrefixCodedTerms? > > > > Here is the specific query, highlighting some specific parts of the code: > > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'): > > > > > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 > > - Count the matching terms in a segment (this is the main bottleneck in > my > > query): > > > > > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 > > > > I appreciate any suggestions you might have. > > > > - Alex > > >