Re: Lucene cpu utilization & scoring

Michael Sokolov Fri, 20 Aug 2021 12:57:39 -0700

I think the usual usage pattern is to *refresh* frequently and commit
less frequently. Is there a reason you need to commit often?

You may also have overlooked this newish method: MergePolicy.findFullFlushMerges

If you implement that, you can tell IndexWriter to (for example) merge
multiple small segments on commit, which may be piling up given
frequent commits, and if you are indexing across multiple threads. We
found this can help reduce the number of segments, and the variability
in the number of segments.  I don't know if that is truly a root cause
of your performance problems here though.

Regarding scoring costs -I don't think creating dummy Weight and
Scorer will do what you think - Scorers are doing matching in fact as
well as scoring. You won't get any results if you don't have any real
Scorer.

I *think* that setting needsScores() to false should disable work done
to compute relevance scores - you can confirm by looking at the scores
you get back with your hits - are they all zero? Also, we did
something similar in our system, and then later re-enabled scoring,
and it did not add significant cost for us. YMMV, but are you sure the
costs you are seeing are related to computing scores and not required
for matching?

-Mike

On Fri, Aug 20, 2021 at 2:02 PM Varun Sharma
<varun.sha...@airbnb.com.invalid> wrote:
>
> Hi,
>
> We have a large index that we divide into X lucene indices - we use lucene
> 6.5.0. On each of our serving machines serves 8 lucene indices in parallel.
> We are getting realtime updates to each of these 8 indices. We are seeing a
> couple of things:
>
> a) When we turn off realtime updates, performance is significantly better.
> When we turn on realtime updates, due to accumulation of segments - CPU
> utilization by lucene goes up by at least *3X* [based on profiling].
>
> b)  A profile shows that the vast majority of time is being spent in
> scoring methods even though we are setting *needsScores() to false* in our
> collectors.
>
> We do commit our index frequently and we are roughly at ~25 segments per
> index - so a total of 8 * 25 ~ 200 segments across all the 8 indices.
>
> Changing the number of 8 indices per machine to reduce the number of
> segments is a significant effort. So, we would like to know if there are
> ways to improve performance, w.r.t a) & b)
>
> i) We have tried some parameters with the merge policy &
> NRTCachingDirectory and they did not help significantly
> ii) Since we dont care about lucene level scores, is there a way to
> completely disable scoring ? Should setting needsScores() to false in our
> collectors do the trick ? Should we create our own dummy weight/scorer and
> injecting it into the Query classes ?
>
> Thanks
> Varun

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Lucene cpu utilization & scoring

Reply via email to