Re: Optimizing term-occurrence counting (code included)

Ali Akhtar Thu, 23 Jul 2020 21:09:17 -0700

I'm new to lucene so I'm not sure what the best way of speeding this up in
Lucene is, but I've previously used https://github.com/npgall/cqengine for
similar stuff. It provided really good performance, especially if you're
just counting things.


On Fri, Jul 24, 2020 at 6:55 AM Alex K <[email protected]> wrote:

> Hi all,
>
> I am working on a query that takes a set of terms, finds all documents
> containing at least one of those terms, computes a subset of candidate docs
> with the most matching terms, and applies a user-provided scoring function
> to each of the candidate docs
>
> Simple example of the query:
> - query terms ("aaa", "bbb")
> - indexed docs with terms:
>   docId 0 has terms ("aaa", "bbb")
>   docId 1 has terms ("aaa", "ccc")
> - number of top candidates = 1
> - simple scoring function score(docId) = docId + 10
> The query first builds a count array [2, 1], because docId 0 contains two
> matching terms and docId 1 contains 1 matching term.
> Then it picks docId 0 as the candidate subset.
> Then it applies the scoring function, returning a score of 10 for docId 0.
>
> The main bottleneck right now is doing the initial counting, i.e. the part
> that returns the [2, 1] array.
>
> I first started by using a BoolQuery containing a Should clause for every
> Term, so the returned score was the count. This was simple but very slow.
> Then I got a substantial speedup by copying and modifying the
> TermInSetQuery so that it tracks the number of times each docId contains a
> query term. The main construct here seems to be PrefixCodedTerms.
>
> At this point I'm not sure if there's any faster construct, or perhaps a
> more optimal way to use PrefixCodedTerms?
>
> Here is the specific query, highlighting some specific parts of the code:
> - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> - Count the matching terms in a segment (this is the main bottleneck in my
> query):
>
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
>
> I appreciate any suggestions you might have.
>
> - Alex
>

Re: Optimizing term-occurrence counting (code included)

Reply via email to