Re: Potential bug

Adrien Grand Mon, 14 Jun 2021 03:29:46 -0700

Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.


+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <[email protected]>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Fri, 11 Jun 2021 at 17:10, <[email protected]> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> >   * matching phase (identifying candidates from the corpus of documents)
> >   * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should identify where are you wasting most of your time, in
> the
> > > matching phase (identifying candidates from the corpus of documents) or
> > in
> > > the ranking phase (scoring them by relevance)?
> > >
> > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > analyze and experiment before raising the alarm of a bug :)
> > >
> > > Also didn't understand this :
> > > "what if the user needs to limit the search process?"
> > > Can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > [1]
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > --------------------------
> > > Alessandro Benedetti
> > > Apache Lucene/Solr Committer
> > > Director, R&D Software Engineer, Search Consultant
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > >
> > >
> > > On Wed, 9 Jun 2021 at 19:08, <[email protected]> wrote:
> > >
> > >> Yes, i did those and i believe i am at the best level of performance
> now
> > >> and it is not bad at all but i want to make it much better.
> > >>
> > >> i see like a linear drop in timings when i go lower number of words
> but
> > >> let me do that quick study again.
> > >>
> > >> Fuzzy search  is always expensive but that seems to suit best to my
> > needs.
> > >>
> > >>
> > >> Thanks Diego for these great questions and i already explored them.
> But
> > >> thanks again.
> > >>
> > >> Best regards
> > >>
> > >>
> > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>> I have never used fuzzy search but from the documentation it seems
> very
> > >> expensive, and if you do it on 10 terms and 1M documents it seems very
> > very
> > >> very expensive.
> > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> > up
> > >> exploring a lot of documents, did you try to play with that parameter?
> > >>> Have you tried to see how the performance change if you do not use
> > fuzzy
> > >> (just to see if is fuzzy the introduce the slow down)?
> > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > >> instead of 10?
> > >>>
> > >>> From: [email protected] At: 06/09/21 18:56:31To:
> > >> [email protected],  [email protected]
> > >>> Subject: Re: Potential bug
> > >>>
> > >>> i cant reveal those details i am very sorry. but it is more than 1
> > >> million.
> > >>> let me tell that i have a lot of code that processes results from
> > lucene
> > >>> but the bottle neck is lucene fuzzy search.
> > >>>
> > >>> Best regards
> > >>>
> > >>>
> > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>> How many documents do you have in the index?
> > >>>> and can you show an example of query?
> > >>>>
> > >>>>
> > >>>> From: [email protected] At: 06/09/21 18:33:25To:
> > >>> [email protected],  [email protected]
> > >>>> Subject: Re: Potential bug
> > >>>>
> > >>>> i have only two fields one string the other is a number (stored as
> > >>>> string), i guess you cant go simpler than this.
> > >>>>
> > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> > >>>>
> > >>>>
> > >>>> i take each word from the string which is usually around at most 10
> > >> words
> > >>>> i build a fuzzy boolean query out of them.
> > >>>>
> > >>>>
> > >>>> simple query is like this 10 word query.
> > >>>>
> > >>>>
> > >>>> limit means i want to stop lucene search around 20 hits i dont want
> > >>>> thousands of hits.
> > >>>>
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>>
> > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>>
> > >>>>> Hi Baris,
> > >>>>>
> > >>>>>> what if the user needs to limit the search process?
> > >>>>> What do you mean by 'limit'?
> > >>>>>
> > >>>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > >>>>>> since for some simple queries it takes half a second which is too
> > >> long.
> > >>>>> What do you mean by 'simple' query? there might be multiple reasons
> > >> behind
> > >>>> slowness of a query that are unrelated to the search (for example,
> if
> > >> you
> > >>>> retrieve many documents and for each document you are extracting the
> > >> content
> > >>> of
> > >>>> many fields) - would you like to tell us a bit more about your use
> > case?
> > >>>>> Regards,
> > >>>>> Diego
> > >>>>>
> > >>>>> From: [email protected] At: 06/09/21 18:18:01To:
> > >>>> [email protected]
> > >>>>> Cc:  [email protected]
> > >>>>> Subject: Re: Potential bug
> > >>>>>
> > >>>>> Thanks Adrien, but the differences is too far apart.
> > >>>>>
> > >>>>> I think the algorithm needs to be revised.
> > >>>>>
> > >>>>>
> > >>>>> what if the user needs to limit the search process?
> > >>>>>
> > >>>>> that leaves no control.
> > >>>>>
> > >>>>> there should be a way to speedup lucene then if this is not
> possible,
> > >>>>>
> > >>>>> since for some simple queries it takes half a second which is too
> > long.
> > >>>>>
> > >>>>> Best regards
> > >>>>>
> > >>>>>
> > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > >>>>>> Hi Baris,
> > >>>>>>
> > >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> > >> threshold.
> > >>>>>> The problem is that Lucene cannot directly identify the top
> matching
> > >>>>>> documents for a given query. The strategy it adopts is to start
> > >> collecting
> > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > about
> > >> the
> > >>>>>> minimum score that is required for a hit to be competitive in
> order
> > >> to skip
> > >>>>>> non-competitive documents. So it's expected that Lucene still
> > >> collects 100s
> > >>>>>> or 1000s of hits, even though the collector is configured to only
> > >> compute
> > >>>>>> the top 10 hits.
> > >>>>>>
> > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <[email protected]> wrote:
> > >>>>>>
> > >>>>>>> Hi,-
> > >>>>>>>
> > >>>>>>>        i think this is a potential bug
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > >> reported as
> > >>>>>>> 1655 but i get 10 results in total.
> > >>>>>>>
> > >>>>>>> I think this suggests that there might be a bug with
> > >>>>>>> TopScoreDocCollector algorithm.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: [email protected]
> > >>>>>>> For additional commands, e-mail:
> [email protected]
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: [email protected]
> > >>>>> For additional commands, e-mail: [email protected]
> > >>>>>
> > >>>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: [email protected]
> > >>>> For additional commands, e-mail: [email protected]
> > >>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >>
> >
>


-- 
Adrien

Re: Potential bug

Reply via email to