BM25 came out of work on probabilistic engines, but using BM25 in Solr doesn’t automatically make it probabilistic.
I read a paper once that showed the two models are not that different, maybe by Karen Sparck-Jones. Still, even with a probabilistic model, relevance cutoffs don’t work. It is still too easy for a good match to have a low score. We’re back to increasing the good hits vs reducing the bad hits. You really only achieve one of those two. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 12, 2017, at 7:41 PM, Koji Sekiguchi <koji.sekigu...@rondhuit.com> > wrote: > > Hi Walter, > > May I ask a tangential question? I'm curious the following line you wrote: > > > Solr is a vector-space engine. Some early engines (Verity VDK) were > > probabilistic engines. Those do give an absolute estimate of the relevance > > of each hit. Unfortunately, the relevance of results is just not as good as > > vector-space engines. So, probabilistic engines are mostly dead. > > Can you elaborate this? > > I thought Okapi BM25, which is the default Similarity on Solr, is based on > the probabilistic > model. Did you mean that Lucene/Solr is still based on vector space model but > they built > BM25Similarity on top of it and therefore, BM25Similarity is not pure > probabilistic scoring > system or Okapi BM25 is not originally probabilistic? > > As for me, I prefer the idea of vector space than probabilistic for the > information retrieval, > and I stick with ClassicSimilarity for my projects. > > Thanks, > > Koji > > > On 2017/04/13 4:08, Walter Underwood wrote: >> Fine. It can’t be done. If it was easy, Solr/Lucene would already have the >> feature, right? >> Solr is a vector-space engine. Some early engines (Verity VDK) were >> probabilistic engines. Those do give an absolute estimate of the relevance >> of each hit. Unfortunately, the relevance of results is just not as good as >> vector-space engines. So, probabilistic engines are mostly dead. >> But, “you don’t want to do it” is very good advice. Instead of trying to >> reduce bad hits, work on increasing good hits. It is really hard, sometimes >> not possible, to optimize both. Increasing the good hits makes your >> customers happy. Reducing the bad hits makes your UX team happy. >> Here is a process. Start collecting the clicks on the search results page >> (SRP) with each query. Look at queries that have below average clickthrough. >> See if those can be combined into categories, then address each category. >> Some categories that I have used: >> * One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all >> valid. Use synonyms or shingles (and maybe the word delimiter filter) to >> match these. >> * Misspellings. These should be about 10% of queries. Use fuzzy matching. I >> recommend the patch in SOLR-629. >> * Alternate vocabulary. You sell a “laptop”, but people call it a >> “notebook”. People search for “kids movies”, but your movie genre is >> “Children and Family”. Use synonyms. >> * Missing content. People can’t find anything about beach parking because >> there isn’t a page about that. Instead, there are scraps of info about beach >> parking in multiple other pages. Fix the content. >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >>> On Apr 12, 2017, at 11:44 AM, David Kramer <david.kra...@shoebuy.com> wrote: >>> >>> The idea is to not return poorly matching results, not to limit the number >>> of results returned. One query may have hundreds of excellent matches and >>> another query may have 7. So cutting off by the number of results is >>> trivial but not useful. >>> >>> Again, we are not doing this for performance reasons. We’re doing this >>> because we don’t want to show products that are not very relevant to the >>> search terms specified by the user for UX reasons. >>> >>> I had hoped that the responses would have been more focused on “it’ can’t >>> be done” or “here’s how to do it” than “you don’t want to do it”. I’m >>> still left not knowing if it’s even possible. The one concrete answer of >>> using frange doesn’t help as referencing score in either the q or the fq >>> produces an “undefined field” error. >>> >>> Thanks. >>> >>> On 4/11/17, 8:59 AM, "Dorian Hoxha" <dorian.ho...@gmail.com> wrote: >>> >>> Can't the filter be used in cases when you're paginating in >>> sharded-scenario ? >>> So if you do limit=10, offset=10, each shard will return 20 docs ? >>> While if you do limit=10, _score<=last_page.min_score, then each shard >>> will >>> return 10 docs ? (they will still score all docs, but merging will be >>> faster) >>> >>> Makes sense ? >>> >>> On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti >>> <a.benede...@sease.io >>>> wrote: >>> >>>> Can i ask what is the final requirement here ? >>>> What are you trying to do ? >>>> - just display less results ? >>>> you can easily do at search client time, cutting after a certain amount >>>> - make search faster returning less results ? >>>> This is not going to work, as you need to score all of them as Erick >>>> explained. >>>> >>>> Function query ( as Mikhail specified) will run on a per document basis ( >>>> if >>>> I am correct), so if your idea was to speed up the things, this is not >>>> going >>>> to work. >>>> >>>> It makes much more sense to refine your system to improve relevancy if your >>>> concern is to have more relevant docs. >>>> If your concern is just to not show that many pages, you can limit that >>>> client side. >>>> >>>> >>>> >>>> >>>> >>>> >>>> ----- >>>> --------------- >>>> Alessandro Benedetti >>>> Search Consultant, R&D Software Engineer, Director >>>> Sease Ltd. - www.sease.io >>>> -- >>>> View this message in context: http://lucene.472066.n3. >>>> nabble.com/Filtering-results-by-minimum-relevancy-score- >>>> tp4329180p4329295.html >>>> Sent from the Solr - User mailing list archive at Nabble.com. >>>> >>> >>>