Re: Relative cpu cost of fetching term frequency during scoring

Vimal Jain Wed, 21 Jun 2023 21:00:32 -0700

I did profiling of new code and found that below api call is most time
consuming :-
org.apache.lucene.index.PostingsEnum#freq
If i comment out this call and instead use some random integer for testing
purpose, then perf is at least 5x compared to old code.
Is there any thoughts on why term frequency calls on PostingsEnum are that
slow ?




*Thanks and Regards,*
*Vimal Jain*


On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand <[email protected]> wrote:

> As far as your performance problem is concerned, I don't know. Can you
> compare the number of documents that need to be evaluated in both cases,
> e.g. by running `IndexSearcher#count` on your two queries. If they're
> similar, can you run your new query under a profiler to figure out what its
> bottleneck is?
>
> Regarding migration to newer major version, there is a MIGRATE.txt that
> gives some advice:
>
> https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
> .
>
> On Wed, Jun 21, 2023 at 8:54 AM Vimal Jain <[email protected]> wrote:
>
> > Thanks Adrien , I had a look at your blog post.  Looks like this
> > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> > A side question , is there any resource to help migrate newer major
> version
> > , i see lot of api changed from v7 to v8.
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
> >
> > On Wed, Jun 21, 2023 at 1:08 AM Adrien Grand <[email protected]> wrote:
> >
> > > Lucene has logic to only evaluate a subset of the matching documents
> when
> > > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If
> you
> > > never implemented it on your custom query, then you never took
> advantage
> > of
> > > dynamic pruning anyway. I wrote a bit more about it
> > > <
> > >
> >
> https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > > >
> > > a few years ago if you're curious.
> > >
> > > On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain <[email protected]> wrote:
> > >
> > > > Thanks Adrien for quick response.
> > > > Yes , i am replacing disjuncts across multiple fields with single
> > custom
> > > > term query over merged field.
> > > > Can you please provide more details on what do you mean by dynamic
> > > pruning
> > > > in context of custom term query ?
> > > >
> > > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <[email protected]>
> wrote:
> > > >
> > > > > Intuitively replacing a disjunction across multiple fields with a
> > > single
> > > > > term query should always be faster.
> > > > >
> > > > > You're saying that you're storing the type of token as part of the
> > term
> > > > > frequency. This doesn't sound like something that would play well
> > with
> > > > > dynamic pruning, so I wonder if this is the reason why you are
> seeing
> > > > > slower queries. But since you mentioned custom term queries, maybe
> > you
> > > > > never actually took advantage of dynamic pruning?
> > > > >
> > > > > On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain <[email protected]>
> > wrote:
> > > > >
> > > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > > So we used to create a lucene query which consisted of custom
> term
> > > > > queries
> > > > > > for different fields and based on the type of field , we used to
> > > > assign a
> > > > > > boost that would be used in scoring.
> > > > > > Now we want to get rid off different fields and instead of
> creating
> > > > > > multiple term queries , we create only 1 term query for the
> merged
> > > > field
> > > > > > and the scorer of this term query ( on merged field ) makes use
> of
> > > > custom
> > > > > > term frequency info to deduce type of token ( during indexing we
> > > store
> > > > > this
> > > > > > info ) and hence the score that we were using earlier.
> > > > > > So perf drop is observed in reference to  earlier implementation
> (
> > > with
> > > > > > multiple term queries ).
> > > > > >
> > > > > >
> > > > > > *Thanks and Regards,*
> > > > > > *Vimal Jain*
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > You say you observed a performance drop, what are you comparing
> > > > > against?
> > > > > > >
> > > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <[email protected]> a
> > > écrit :
> > > > > > >
> > > > > > > > Note - i am using lucene 7.7.3
> > > > > > > >
> > > > > > > > *Thanks and Regards,*
> > > > > > > > *Vimal Jain*
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain <
> [email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > > I want to understand if fetching the term frequency of a
> term
> > > > > during
> > > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > > Context - I am storing custom term frequency during
> indexing
> > > and
> > > > > > later
> > > > > > > > > using it for scoring during query execution time ( in
> > Scorer's
> > > > > > score()
> > > > > > > > > method ). I noticed a performance drop in my application
> and
> > I
> > > > > > suspect
> > > > > > > > it's
> > > > > > > > > because of this change.
> > > > > > > > > Any insight or related articles for reference would be
> > > > appreciated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > *Thanks and Regards,*
> > > > > > > > > *Vimal Jain*
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Adrien
> > > > >
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>

Re: Relative cpu cost of fetching term frequency during scoring

Reply via email to