Re: Potential bug

Atri Sharma Mon, 14 Jun 2021 05:45:51 -0700

+1 to Adrien.

Let's keep the tone neutral.


On Mon, 14 Jun 2021, 16:00 Adrien Grand, <jpou...@gmail.com> wrote:

> Baris, you called out an insult from Alessandro and your replies suggest
> anger, but I couldn't see an insult from Alessandro actually.
>
> +1 to Alessandro's call to make the tone softer on this discussion.
>
> On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <
> a.benede...@sease.io>
> wrote:
>
> > Hi Baris,
> > first of all apologies for having misspelled your name, definitely, it
> was
> > not meant as an insult.
> > Secondly, your tone is not acceptable on this mailing list (or anywhere
> > else).
> > You must remember that we, committers, are operating on a volunteering
> > basis, contributing code and helping people in our free time purely
> driven
> > by passion.
> > Respect is fundamental, we are not here to be treated aggressively.
> >
> > Regards
> >
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Fri, 11 Jun 2021 at 17:10, <baris.ka...@oracle.com> wrote:
> >
> > > Let me guide to a professional answer to the below email:
> > >
> > >
> > > Hi Baris,
> > >
> > > Since You mentioned You did all the performance study on your
> > > application and still believe that
> > >
> > > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > > time the application for:
> > >
> > >   * matching phase (identifying candidates from the corpus of
> documents)
> > >   * or in the ranking phase (scoring them by relevance)?
> > >
> > > Maybe this will help speedup further.
> > >
> > > Also, what do You mean by "what is the user needs to to limit te search
> > > process" ? can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > My answer would be :
> > >
> > > i cant access the Lucene code so how can time these two cases please?
> > >
> > > i mean by that sentence that when i see the hits are good i would like
> > > to limit the number of hits.
> > >
> > >
> > >
> > > this is more like a professional conversation please. Thanks.
> > >
> > > Best regards
> > >
> > >
> > > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > > Hi Bazir,
> > > > this feels like an X Y problem [1 <
> > >
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > >].
> > > > Can you express what is your original user requirement?
> > > > Most of the time, at the cost of indexing time/space you may get
> > quicker
> > > > query times.
> > > > Also, you should identify where are you wasting most of your time, in
> > the
> > > > matching phase (identifying candidates from the corpus of documents)
> or
> > > in
> > > > the ranking phase (scoring them by relevance)?
> > > >
> > > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > > analyze and experiment before raising the alarm of a bug :)
> > > >
> > > > Also didn't understand this :
> > > > "what if the user needs to limit the search process?"
> > > > Can you elaborate?
> > > >
> > > > Cheers
> > > >
> > > >
> > > >
> > > > [1]
> > >
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > > --------------------------
> > > > Alessandro Benedetti
> > > > Apache Lucene/Solr Committer
> > > > Director, R&D Software Engineer, Search Consultant
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > > >
> > > >
> > > > On Wed, 9 Jun 2021 at 19:08, <baris.ka...@oracle.com> wrote:
> > > >
> > > >> Yes, i did those and i believe i am at the best level of performance
> > now
> > > >> and it is not bad at all but i want to make it much better.
> > > >>
> > > >> i see like a linear drop in timings when i go lower number of words
> > but
> > > >> let me do that quick study again.
> > > >>
> > > >> Fuzzy search  is always expensive but that seems to suit best to my
> > > needs.
> > > >>
> > > >>
> > > >> Thanks Diego for these great questions and i already explored them.
> > But
> > > >> thanks again.
> > > >>
> > > >> Best regards
> > > >>
> > > >>
> > > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>> I have never used fuzzy search but from the documentation it seems
> > very
> > > >> expensive, and if you do it on 10 terms and 1M documents it seems
> very
> > > very
> > > >> very expensive.
> > > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might
> end
> > > up
> > > >> exploring a lot of documents, did you try to play with that
> parameter?
> > > >>> Have you tried to see how the performance change if you do not use
> > > fuzzy
> > > >> (just to see if is fuzzy the introduce the slow down)?
> > > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > > >> instead of 10?
> > > >>>
> > > >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> > > >> java-user@lucene.apache.org,  baris.ka...@oracle.com
> > > >>> Subject: Re: Potential bug
> > > >>>
> > > >>> i cant reveal those details i am very sorry. but it is more than 1
> > > >> million.
> > > >>> let me tell that i have a lot of code that processes results from
> > > lucene
> > > >>> but the bottle neck is lucene fuzzy search.
> > > >>>
> > > >>> Best regards
> > > >>>
> > > >>>
> > > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>>> How many documents do you have in the index?
> > > >>>> and can you show an example of query?
> > > >>>>
> > > >>>>
> > > >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> > > >>> java-user@lucene.apache.org,  baris.ka...@oracle.com
> > > >>>> Subject: Re: Potential bug
> > > >>>>
> > > >>>> i have only two fields one string the other is a number (stored as
> > > >>>> string), i guess you cant go simpler than this.
> > > >>>>
> > > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy
> search.
> > > >>>>
> > > >>>>
> > > >>>> i take each word from the string which is usually around at most
> 10
> > > >> words
> > > >>>> i build a fuzzy boolean query out of them.
> > > >>>>
> > > >>>>
> > > >>>> simple query is like this 10 word query.
> > > >>>>
> > > >>>>
> > > >>>> limit means i want to stop lucene search around 20 hits i dont
> want
> > > >>>> thousands of hits.
> > > >>>>
> > > >>>>
> > > >>>> Best regards
> > > >>>>
> > > >>>>
> > > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>>>
> > > >>>>> Hi Baris,
> > > >>>>>
> > > >>>>>> what if the user needs to limit the search process?
> > > >>>>> What do you mean by 'limit'?
> > > >>>>>
> > > >>>>>> there should be a way to speedup lucene then if this is not
> > > possible,
> > > >>>>>> since for some simple queries it takes half a second which is
> too
> > > >> long.
> > > >>>>> What do you mean by 'simple' query? there might be multiple
> reasons
> > > >> behind
> > > >>>> slowness of a query that are unrelated to the search (for example,
> > if
> > > >> you
> > > >>>> retrieve many documents and for each document you are extracting
> the
> > > >> content
> > > >>> of
> > > >>>> many fields) - would you like to tell us a bit more about your use
> > > case?
> > > >>>>> Regards,
> > > >>>>> Diego
> > > >>>>>
> > > >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> > > >>>> java-user@lucene.apache.org
> > > >>>>> Cc:  baris.ka...@oracle.com
> > > >>>>> Subject: Re: Potential bug
> > > >>>>>
> > > >>>>> Thanks Adrien, but the differences is too far apart.
> > > >>>>>
> > > >>>>> I think the algorithm needs to be revised.
> > > >>>>>
> > > >>>>>
> > > >>>>> what if the user needs to limit the search process?
> > > >>>>>
> > > >>>>> that leaves no control.
> > > >>>>>
> > > >>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > > >>>>>
> > > >>>>> since for some simple queries it takes half a second which is too
> > > long.
> > > >>>>>
> > > >>>>> Best regards
> > > >>>>>
> > > >>>>>
> > > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > > >>>>>> Hi Baris,
> > > >>>>>>
> > > >>>>>> totalhitsThreshold is actually a minimum threshold, not a
> maximum
> > > >> threshold.
> > > >>>>>> The problem is that Lucene cannot directly identify the top
> > matching
> > > >>>>>> documents for a given query. The strategy it adopts is to start
> > > >> collecting
> > > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > > about
> > > >> the
> > > >>>>>> minimum score that is required for a hit to be competitive in
> > order
> > > >> to skip
> > > >>>>>> non-competitive documents. So it's expected that Lucene still
> > > >> collects 100s
> > > >>>>>> or 1000s of hits, even though the collector is configured to
> only
> > > >> compute
> > > >>>>>> the top 10 hits.
> > > >>>>>>
> > > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.ka...@oracle.com> wrote:
> > > >>>>>>
> > > >>>>>>> Hi,-
> > > >>>>>>>
> > > >>>>>>>        i think this is a potential bug
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > > >> reported as
> > > >>>>>>> 1655 but i get 10 results in total.
> > > >>>>>>>
> > > >>>>>>> I think this suggests that there might be a bug with
> > > >>>>>>> TopScoreDocCollector algorithm.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Best regards
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>>> To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > >>>>>>> For additional commands, e-mail:
> > java-user-h...@lucene.apache.org
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > ---------------------------------------------------------------------
> > > >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >>>>> For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> > > >>>>>
> > > >>>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >>>
> > > >>>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >>
> > > >>
> > >
> >
>
>
> --
> Adrien
>

Re: Potential bug

Reply via email to