BM25 came out of work on probabilistic engines, but using BM25 in Solr doesn’t 
automatically make it probabilistic.

I read a paper once that showed the two models are not that different, maybe by 
Karen Sparck-Jones.

Still, even with a probabilistic model, relevance cutoffs don’t work. It is 
still too easy for a good match to have a low score. We’re back to increasing 
the good hits vs reducing the bad hits. You really only achieve one of those 

Walter Underwood  (my blog)

> On Apr 12, 2017, at 7:41 PM, Koji Sekiguchi <> 
> wrote:
> Hi Walter,
> May I ask a tangential question? I'm curious the following line you wrote:
> > Solr is a vector-space engine. Some early engines (Verity VDK) were 
> > probabilistic engines. Those do give an absolute estimate of the relevance 
> > of each hit. Unfortunately, the relevance of results is just not as good as 
> > vector-space engines. So, probabilistic engines are mostly dead.
> Can you elaborate this?
> I thought Okapi BM25, which is the default Similarity on Solr, is based on 
> the probabilistic
> model. Did you mean that Lucene/Solr is still based on vector space model but 
> they built
> BM25Similarity on top of it and therefore, BM25Similarity is not pure 
> probabilistic scoring
> system or Okapi BM25 is not originally probabilistic?
> As for me, I prefer the idea of vector space than probabilistic for the 
> information retrieval,
> and I stick with ClassicSimilarity for my projects.
> Thanks,
> Koji
> On 2017/04/13 4:08, Walter Underwood wrote:
>> Fine. It can’t be done. If it was easy, Solr/Lucene would already have the 
>> feature, right?
>> Solr is a vector-space engine. Some early engines (Verity VDK) were 
>> probabilistic engines. Those do give an absolute estimate of the relevance 
>> of each hit. Unfortunately, the relevance of results is just not as good as 
>> vector-space engines. So, probabilistic engines are mostly dead.
>> But, “you don’t want to do it” is very good advice. Instead of trying to 
>> reduce bad hits, work on increasing good hits. It is really hard, sometimes 
>> not possible, to optimize both. Increasing the good hits makes your 
>> customers happy. Reducing the bad hits makes your UX team happy.
>> Here is a process. Start collecting the clicks on the search results page 
>> (SRP) with each query. Look at queries that have below average clickthrough. 
>> See if those can be combined into categories, then address each category.
>> Some categories that I have used:
>> * One word or two? “babysitter”, “baby-sitter”, and “baby sitter” are all 
>> valid. Use synonyms or shingles (and maybe the word delimiter filter) to 
>> match these.
>> * Misspellings. These should be about 10% of queries. Use fuzzy matching. I 
>> recommend the patch in SOLR-629.
>> * Alternate vocabulary. You sell a “laptop”, but people call it a 
>> “notebook”. People search for “kids movies”, but your movie genre is 
>> “Children and Family”. Use synonyms.
>> * Missing content. People can’t find anything about beach parking because 
>> there isn’t a page about that. Instead, there are scraps of info about beach 
>> parking in multiple other pages. Fix the content.
>> wunder
>> Walter Underwood
>>  (my blog)
>>> On Apr 12, 2017, at 11:44 AM, David Kramer <> wrote:
>>> The idea is to not return poorly matching results, not to limit the number 
>>> of results returned.  One query may have hundreds of excellent matches and 
>>> another query may have 7. So cutting off by the number of results is 
>>> trivial but not useful.
>>> Again, we are not doing this for performance reasons. We’re doing this 
>>> because we don’t want to show products that are not very relevant to the 
>>> search terms specified by the user for UX reasons.
>>> I had hoped that the responses would have been more focused on “it’ can’t 
>>> be done” or “here’s how to do it” than “you don’t want to do it”.   I’m 
>>> still left not knowing if it’s even possible. The one concrete answer of 
>>> using frange doesn’t help as referencing score in either the q or the fq 
>>> produces an “undefined field” error.
>>> Thanks.
>>> On 4/11/17, 8:59 AM, "Dorian Hoxha" <> wrote:
>>>    Can't the filter be used in cases when you're paginating in
>>>    sharded-scenario ?
>>>    So if you do limit=10, offset=10, each shard will return 20 docs ?
>>>    While if you do limit=10, _score<=last_page.min_score, then each shard 
>>> will
>>>    return 10 docs ? (they will still score all docs, but merging will be
>>>    faster)
>>>    Makes sense ?
>>>    On Tue, Apr 11, 2017 at 12:49 PM, alessandro.benedetti 
>>> <
>>>> wrote:
>>>> Can i ask what is the final requirement here ?
>>>> What are you trying to do ?
>>>> - just display less results ?
>>>> you can easily do at search client time, cutting after a certain amount
>>>> - make search faster returning less results ?
>>>> This is not going to work, as you need to score all of them as Erick
>>>> explained.
>>>> Function query ( as Mikhail specified) will run on a per document basis (
>>>> if
>>>> I am correct), so if your idea was to speed up the things, this is not
>>>> going
>>>> to work.
>>>> It makes much more sense to refine your system to improve relevancy if your
>>>> concern is to have more relevant docs.
>>>> If your concern is just to not show that many pages, you can limit that
>>>> client side.
>>>> -----
>>>> ---------------
>>>> Alessandro Benedetti
>>>> Search Consultant, R&D Software Engineer, Director
>>>> Sease Ltd. -
>>>> --
>>>> View this message in context: http://lucene.472066.n3.
>>>> tp4329180p4329295.html
>>>> Sent from the Solr - User mailing list archive at

Reply via email to