[ 
https://issues.apache.org/jira/browse/SOLR-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926877#comment-16926877
 ] 

Andy Hind commented on SOLR-13752:
----------------------------------

https://github.com/apache/lucene-solr/pull/871

> MoreLikeThis MLT is biased for uncommon fields
> ----------------------------------------------
>
>                 Key: SOLR-13752
>                 URL: https://issues.apache.org/jira/browse/SOLR-13752
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: MoreLikeThis
>            Reporter: Andy Hind
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> MLT always uses the total doc count and not the count of docs with the 
> specific field
>  
> To quote Maria Mestre from the discussion on the mailing list - 29/01/19
>  
> {quote}The issue I have is that when retrieving the key scored terms 
> (interestingTerms), the code uses the total number of documents in the index, 
> not the total number of documents with populated “description” field. This is 
> where it’s done in the code: 
> [https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA&m=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ&s=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs&e=]
> The effect of this choice is that the “idf” does not vary much, given that 
> numDocs >> number of documents with “description”, so the key terms end up 
> being just the terms with the highest term frequencies.
> It is inconsistent because the MLT-search then uses these extracted key terms 
> and scores all documents using an idf which is computed only on the subset of 
> documents with “description”. So one part of the MLT uses a different numDocs 
> than another part. This sounds like an odd choice, and not expected at all, 
> and I wonder if I’m missing something.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to