Re: Relative cpu cost of fetching term frequency during scoring

Adrien Grand Mon, 26 Jun 2023 05:41:07 -0700

This is a bit surprising, can you share the profiler output (e.g.
screenshot), to see what is slow within the `PostingsEnum#freq` call?


`PostingsEnum#freq` may need to decode a block of freqs, but I would
generally not expect it to be 5x slower than decoding doc IDs for the
same block.

On Thu, Jun 22, 2023 at 6:00 AM Vimal Jain <[email protected]> wrote:
>
> I did profiling of new code and found that below api call is most time
> consuming :-
> org.apache.lucene.index.PostingsEnum#freq
> If i comment out this call and instead use some random integer for testing
> purpose, then perf is at least 5x compared to old code.
> Is there any thoughts on why term frequency calls on PostingsEnum are that
> slow ?
>
>
>
> *Thanks and Regards,*
> *Vimal Jain*
>
>
> On Wed, Jun 21, 2023 at 1:43 PM Adrien Grand <[email protected]> wrote:
>
> > As far as your performance problem is concerned, I don't know. Can you
> > compare the number of documents that need to be evaluated in both cases,
> > e.g. by running `IndexSearcher#count` on your two queries. If they're
> > similar, can you run your new query under a profiler to figure out what its
> > bottleneck is?
> >
> > Regarding migration to newer major version, there is a MIGRATE.txt that
> > gives some advice:
> >
> > https://github.com/apache/lucene/blob/releases/lucene-solr/8.0.0/lucene/MIGRATE.txt
> > .
> >
> > On Wed, Jun 21, 2023 at 8:54 AM Vimal Jain <[email protected]> wrote:
> >
> > > Thanks Adrien , I had a look at your blog post.  Looks like this
> > > Scorer#getMaxScore was added in lucene 8.0 , i am using 7.7.3.
> > > A side question , is there any resource to help migrate newer major
> > version
> > > , i see lot of api changed from v7 to v8.
> > >
> > > *Thanks and Regards,*
> > > *Vimal Jain*
> > >
> > >
> > > On Wed, Jun 21, 2023 at 1:08 AM Adrien Grand <[email protected]> wrote:
> > >
> > > > Lucene has logic to only evaluate a subset of the matching documents
> > when
> > > > retrieving top-k hits. This leverages the Scorer#getMaxScore API. If
> > you
> > > > never implemented it on your custom query, then you never took
> > advantage
> > > of
> > > > dynamic pruning anyway. I wrote a bit more about it
> > > > <
> > > >
> > >
> > https://www.elastic.co/blog/faster-retrieval-of-top-hits-in-elasticsearch-with-block-max-wand
> > > > >
> > > > a few years ago if you're curious.
> > > >
> > > > On Tue, Jun 20, 2023 at 6:58 PM Vimal Jain <[email protected]> wrote:
> > > >
> > > > > Thanks Adrien for quick response.
> > > > > Yes , i am replacing disjuncts across multiple fields with single
> > > custom
> > > > > term query over merged field.
> > > > > Can you please provide more details on what do you mean by dynamic
> > > > pruning
> > > > > in context of custom term query ?
> > > > >
> > > > > On Tue, 20 Jun, 2023, 9:45 pm Adrien Grand, <[email protected]>
> > wrote:
> > > > >
> > > > > > Intuitively replacing a disjunction across multiple fields with a
> > > > single
> > > > > > term query should always be faster.
> > > > > >
> > > > > > You're saying that you're storing the type of token as part of the
> > > term
> > > > > > frequency. This doesn't sound like something that would play well
> > > with
> > > > > > dynamic pruning, so I wonder if this is the reason why you are
> > seeing
> > > > > > slower queries. But since you mentioned custom term queries, maybe
> > > you
> > > > > > never actually took advantage of dynamic pruning?
> > > > > >
> > > > > > On Tue, Jun 20, 2023 at 10:30 AM Vimal Jain <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Ok , sorry , I realized that I need to provide more context.
> > > > > > > So we used to create a lucene query which consisted of custom
> > term
> > > > > > queries
> > > > > > > for different fields and based on the type of field , we used to
> > > > > assign a
> > > > > > > boost that would be used in scoring.
> > > > > > > Now we want to get rid off different fields and instead of
> > creating
> > > > > > > multiple term queries , we create only 1 term query for the
> > merged
> > > > > field
> > > > > > > and the scorer of this term query ( on merged field ) makes use
> > of
> > > > > custom
> > > > > > > term frequency info to deduce type of token ( during indexing we
> > > > store
> > > > > > this
> > > > > > > info ) and hence the score that we were using earlier.
> > > > > > > So perf drop is observed in reference to  earlier implementation
> > (
> > > > with
> > > > > > > multiple term queries ).
> > > > > > >
> > > > > > >
> > > > > > > *Thanks and Regards,*
> > > > > > > *Vimal Jain*
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jun 20, 2023 at 1:01 PM Adrien Grand <[email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > > You say you observed a performance drop, what are you comparing
> > > > > > against?
> > > > > > > >
> > > > > > > > Le mar. 20 juin 2023, 08:59, Vimal Jain <[email protected]> a
> > > > écrit :
> > > > > > > >
> > > > > > > > > Note - i am using lucene 7.7.3
> > > > > > > > >
> > > > > > > > > *Thanks and Regards,*
> > > > > > > > > *Vimal Jain*
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jun 20, 2023 at 12:26 PM Vimal Jain <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > > I want to understand if fetching the term frequency of a
> > term
> > > > > > during
> > > > > > > > > > scoring is relatively cpu bound operation ?
> > > > > > > > > > Context - I am storing custom term frequency during
> > indexing
> > > > and
> > > > > > > later
> > > > > > > > > > using it for scoring during query execution time ( in
> > > Scorer's
> > > > > > > score()
> > > > > > > > > > method ). I noticed a performance drop in my application
> > and
> > > I
> > > > > > > suspect
> > > > > > > > > it's
> > > > > > > > > > because of this change.
> > > > > > > > > > Any insight or related articles for reference would be
> > > > > appreciated.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > *Thanks and Regards,*
> > > > > > > > > > *Vimal Jain*
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Adrien
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> >
> >
> > --
> > Adrien
> >



-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Relative cpu cost of fetching term frequency during scoring

Reply via email to