I left a comment on the issue. Mike McCandless
http://blog.mikemccandless.com On Sun, Sep 20, 2020 at 1:08 PM Alex K <aklib...@gmail.com> wrote: > Hi all, I'm still a bit stuck on this particular issue.I posted an issue on > the Elastiknn repo outlining some measurements and thoughts on potential > solutions: https://github.com/alexklibisz/elastiknn/issues/160 > > To restate the question: Is there a known optimal way to find and count > docs matching 10s to 100s of terms? It seems the bottleneck is in the > PostingsFormat implementation. Perhaps there is a PostingsFormat better > suited for this usecase? > > Thanks, > Alex > > On Fri, Jul 24, 2020 at 7:59 AM Alex K <aklib...@gmail.com> wrote: > > > Thanks Ali. I don't think that will work in this case, since the data I'm > > counting is managed by lucene, but that looks like an interesting > project. > > -Alex > > > > On Fri, Jul 24, 2020, 00:15 Ali Akhtar <ali@ali.actor> wrote: > > > >> I'm new to lucene so I'm not sure what the best way of speeding this up > in > >> Lucene is, but I've previously used https://github.com/npgall/cqengine > >> for > >> similar stuff. It provided really good performance, especially if you're > >> just counting things. > >> > >> On Fri, Jul 24, 2020 at 6:55 AM Alex K <aklib...@gmail.com> wrote: > >> > >> > Hi all, > >> > > >> > I am working on a query that takes a set of terms, finds all documents > >> > containing at least one of those terms, computes a subset of candidate > >> docs > >> > with the most matching terms, and applies a user-provided scoring > >> function > >> > to each of the candidate docs > >> > > >> > Simple example of the query: > >> > - query terms ("aaa", "bbb") > >> > - indexed docs with terms: > >> > docId 0 has terms ("aaa", "bbb") > >> > docId 1 has terms ("aaa", "ccc") > >> > - number of top candidates = 1 > >> > - simple scoring function score(docId) = docId + 10 > >> > The query first builds a count array [2, 1], because docId 0 contains > >> two > >> > matching terms and docId 1 contains 1 matching term. > >> > Then it picks docId 0 as the candidate subset. > >> > Then it applies the scoring function, returning a score of 10 for > docId > >> 0. > >> > > >> > The main bottleneck right now is doing the initial counting, i.e. the > >> part > >> > that returns the [2, 1] array. > >> > > >> > I first started by using a BoolQuery containing a Should clause for > >> every > >> > Term, so the returned score was the count. This was simple but very > >> slow. > >> > Then I got a substantial speedup by copying and modifying the > >> > TermInSetQuery so that it tracks the number of times each docId > >> contains a > >> > query term. The main construct here seems to be PrefixCodedTerms. > >> > > >> > At this point I'm not sure if there's any faster construct, or > perhaps a > >> > more optimal way to use PrefixCodedTerms? > >> > > >> > Here is the specific query, highlighting some specific parts of the > >> code: > >> > - Build the PrefixCodedTerms (in my case the terms are called > 'hashes'): > >> > > >> > > >> > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 > >> > - Count the matching terms in a segment (this is the main bottleneck > in > >> my > >> > query): > >> > > >> > > >> > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 > >> > > >> > I appreciate any suggestions you might have. > >> > > >> > - Alex > >> > > >> > > >