Re: Optimizing term-occurrence counting (code included)

Alex K Sun, 20 Sep 2020 10:08:31 -0700

Hi all, I'm still a bit stuck on this particular issue.I posted an issue on
the Elastiknn repo outlining some measurements and thoughts on potential
solutions: https://github.com/alexklibisz/elastiknn/issues/160


To restate the question: Is there a known optimal way to find and count
docs matching 10s to 100s of terms? It seems the bottleneck is in the
PostingsFormat implementation. Perhaps there is a PostingsFormat better
suited for this usecase?

Thanks,
Alex

On Fri, Jul 24, 2020 at 7:59 AM Alex K <[email protected]> wrote:

> Thanks Ali. I don't think that will work in this case, since the data I'm
> counting is managed by lucene, but that looks like an interesting project.
> -Alex
>
> On Fri, Jul 24, 2020, 00:15 Ali Akhtar <[email protected]> wrote:
>
>> I'm new to lucene so I'm not sure what the best way of speeding this up in
>> Lucene is, but I've previously used https://github.com/npgall/cqengine
>> for
>> similar stuff. It provided really good performance, especially if you're
>> just counting things.
>>
>> On Fri, Jul 24, 2020 at 6:55 AM Alex K <[email protected]> wrote:
>>
>> > Hi all,
>> >
>> > I am working on a query that takes a set of terms, finds all documents
>> > containing at least one of those terms, computes a subset of candidate
>> docs
>> > with the most matching terms, and applies a user-provided scoring
>> function
>> > to each of the candidate docs
>> >
>> > Simple example of the query:
>> > - query terms ("aaa", "bbb")
>> > - indexed docs with terms:
>> >   docId 0 has terms ("aaa", "bbb")
>> >   docId 1 has terms ("aaa", "ccc")
>> > - number of top candidates = 1
>> > - simple scoring function score(docId) = docId + 10
>> > The query first builds a count array [2, 1], because docId 0 contains
>> two
>> > matching terms and docId 1 contains 1 matching term.
>> > Then it picks docId 0 as the candidate subset.
>> > Then it applies the scoring function, returning a score of 10 for docId
>> 0.
>> >
>> > The main bottleneck right now is doing the initial counting, i.e. the
>> part
>> > that returns the [2, 1] array.
>> >
>> > I first started by using a BoolQuery containing a Should clause for
>> every
>> > Term, so the returned score was the count. This was simple but very
>> slow.
>> > Then I got a substantial speedup by copying and modifying the
>> > TermInSetQuery so that it tracks the number of times each docId
>> contains a
>> > query term. The main construct here seems to be PrefixCodedTerms.
>> >
>> > At this point I'm not sure if there's any faster construct, or perhaps a
>> > more optimal way to use PrefixCodedTerms?
>> >
>> > Here is the specific query, highlighting some specific parts of the
>> code:
>> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
>> > - Count the matching terms in a segment (this is the main bottleneck in
>> my
>> > query):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
>> >
>> > I appreciate any suggestions you might have.
>> >
>> > - Alex
>> >
>>
>

Re: Optimizing term-occurrence counting (code included)

Reply via email to