Baris, you called out an insult from Alessandro and your replies suggest anger, but I couldn't see an insult from Alessandro actually.
+1 to Alessandro's call to make the tone softer on this discussion. On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <a.benede...@sease.io> wrote: > Hi Baris, > first of all apologies for having misspelled your name, definitely, it was > not meant as an insult. > Secondly, your tone is not acceptable on this mailing list (or anywhere > else). > You must remember that we, committers, are operating on a volunteering > basis, contributing code and helping people in our free time purely driven > by passion. > Respect is fundamental, we are not here to be treated aggressively. > > Regards > > -------------------------- > Alessandro Benedetti > Apache Lucene/Solr Committer > Director, R&D Software Engineer, Search Consultant > > www.sease.io > > > On Fri, 11 Jun 2021 at 17:10, <baris.ka...@oracle.com> wrote: > > > Let me guide to a professional answer to the below email: > > > > > > Hi Baris, > > > > Since You mentioned You did all the performance study on your > > application and still believe that > > > > the bottleneck is the fuzzy search api from Lucene, it would be best to > > time the application for: > > > > * matching phase (identifying candidates from the corpus of documents) > > * or in the ranking phase (scoring them by relevance)? > > > > Maybe this will help speedup further. > > > > Also, what do You mean by "what is the user needs to to limit te search > > process" ? can you elaborate? > > > > Cheers > > > > > > > > My answer would be : > > > > i cant access the Lucene code so how can time these two cases please? > > > > i mean by that sentence that when i see the hits are good i would like > > to limit the number of hits. > > > > > > > > this is more like a professional conversation please. Thanks. > > > > Best regards > > > > > > On 6/11/21 11:57 AM, Alessandro Benedetti wrote: > > > Hi Bazir, > > > this feels like an X Y problem [1 < > > > https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$ > > >]. > > > Can you express what is your original user requirement? > > > Most of the time, at the cost of indexing time/space you may get > quicker > > > query times. > > > Also, you should identify where are you wasting most of your time, in > the > > > matching phase (identifying candidates from the corpus of documents) or > > in > > > the ranking phase (scoring them by relevance)? > > > > > > TopScoreDocCollector is quite a solid class, there's a ton to study, > > > analyze and experiment before raising the alarm of a bug :) > > > > > > Also didn't understand this : > > > "what if the user needs to limit the search process?" > > > Can you elaborate? > > > > > > Cheers > > > > > > > > > > > > [1] > > > https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$ > > > -------------------------- > > > Alessandro Benedetti > > > Apache Lucene/Solr Committer > > > Director, R&D Software Engineer, Search Consultant > > > > > > > > > https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$ > > > > > > > > > On Wed, 9 Jun 2021 at 19:08, <baris.ka...@oracle.com> wrote: > > > > > >> Yes, i did those and i believe i am at the best level of performance > now > > >> and it is not bad at all but i want to make it much better. > > >> > > >> i see like a linear drop in timings when i go lower number of words > but > > >> let me do that quick study again. > > >> > > >> Fuzzy search is always expensive but that seems to suit best to my > > needs. > > >> > > >> > > >> Thanks Diego for these great questions and i already explored them. > But > > >> thanks again. > > >> > > >> Best regards > > >> > > >> > > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote: > > >>> I have never used fuzzy search but from the documentation it seems > very > > >> expensive, and if you do it on 10 terms and 1M documents it seems very > > very > > >> very expensive. > > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end > > up > > >> exploring a lot of documents, did you try to play with that parameter? > > >>> Have you tried to see how the performance change if you do not use > > fuzzy > > >> (just to see if is fuzzy the introduce the slow down)? > > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms > > >> instead of 10? > > >>> > > >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To: > > >> java-user@lucene.apache.org, baris.ka...@oracle.com > > >>> Subject: Re: Potential bug > > >>> > > >>> i cant reveal those details i am very sorry. but it is more than 1 > > >> million. > > >>> let me tell that i have a lot of code that processes results from > > lucene > > >>> but the bottle neck is lucene fuzzy search. > > >>> > > >>> Best regards > > >>> > > >>> > > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote: > > >>>> How many documents do you have in the index? > > >>>> and can you show an example of query? > > >>>> > > >>>> > > >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To: > > >>> java-user@lucene.apache.org, baris.ka...@oracle.com > > >>>> Subject: Re: Potential bug > > >>>> > > >>>> i have only two fields one string the other is a number (stored as > > >>>> string), i guess you cant go simpler than this. > > >>>> > > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search. > > >>>> > > >>>> > > >>>> i take each word from the string which is usually around at most 10 > > >> words > > >>>> i build a fuzzy boolean query out of them. > > >>>> > > >>>> > > >>>> simple query is like this 10 word query. > > >>>> > > >>>> > > >>>> limit means i want to stop lucene search around 20 hits i dont want > > >>>> thousands of hits. > > >>>> > > >>>> > > >>>> Best regards > > >>>> > > >>>> > > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote: > > >>>> > > >>>>> Hi Baris, > > >>>>> > > >>>>>> what if the user needs to limit the search process? > > >>>>> What do you mean by 'limit'? > > >>>>> > > >>>>>> there should be a way to speedup lucene then if this is not > > possible, > > >>>>>> since for some simple queries it takes half a second which is too > > >> long. > > >>>>> What do you mean by 'simple' query? there might be multiple reasons > > >> behind > > >>>> slowness of a query that are unrelated to the search (for example, > if > > >> you > > >>>> retrieve many documents and for each document you are extracting the > > >> content > > >>> of > > >>>> many fields) - would you like to tell us a bit more about your use > > case? > > >>>>> Regards, > > >>>>> Diego > > >>>>> > > >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To: > > >>>> java-user@lucene.apache.org > > >>>>> Cc: baris.ka...@oracle.com > > >>>>> Subject: Re: Potential bug > > >>>>> > > >>>>> Thanks Adrien, but the differences is too far apart. > > >>>>> > > >>>>> I think the algorithm needs to be revised. > > >>>>> > > >>>>> > > >>>>> what if the user needs to limit the search process? > > >>>>> > > >>>>> that leaves no control. > > >>>>> > > >>>>> there should be a way to speedup lucene then if this is not > possible, > > >>>>> > > >>>>> since for some simple queries it takes half a second which is too > > long. > > >>>>> > > >>>>> Best regards > > >>>>> > > >>>>> > > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote: > > >>>>>> Hi Baris, > > >>>>>> > > >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum > > >> threshold. > > >>>>>> The problem is that Lucene cannot directly identify the top > matching > > >>>>>> documents for a given query. The strategy it adopts is to start > > >> collecting > > >>>>>> hits naively in doc ID order and to progressively raise the bar > > about > > >> the > > >>>>>> minimum score that is required for a hit to be competitive in > order > > >> to skip > > >>>>>> non-competitive documents. So it's expected that Lucene still > > >> collects 100s > > >>>>>> or 1000s of hits, even though the collector is configured to only > > >> compute > > >>>>>> the top 10 hits. > > >>>>>> > > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.ka...@oracle.com> wrote: > > >>>>>> > > >>>>>>> Hi,- > > >>>>>>> > > >>>>>>> i think this is a potential bug > > >>>>>>> > > >>>>>>> > > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits > > >> reported as > > >>>>>>> 1655 but i get 10 results in total. > > >>>>>>> > > >>>>>>> I think this suggests that there might be a bug with > > >>>>>>> TopScoreDocCollector algorithm. > > >>>>>>> > > >>>>>>> > > >>>>>>> Best regards > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> > > --------------------------------------------------------------------- > > >>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>>>>> For additional commands, e-mail: > java-user-h...@lucene.apache.org > > >>>>>>> > > >>>>>>> > > >>>>> > --------------------------------------------------------------------- > > >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>>>> > > >>>>> > > >>>> > --------------------------------------------------------------------- > > >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>>> > > >>>> > > >>> --------------------------------------------------------------------- > > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >>> > > >>> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > -- Adrien