[jira] [Created] (SOLR-13752) MoreLikeThis MLT is biased for uncommon fields

Andy Hind (Jira) Tue, 10 Sep 2019 12:33:45 -0700

Andy Hind created SOLR-13752:
--------------------------------

             Summary: MoreLikeThis MLT is biased for uncommon fields
                 Key: SOLR-13752
                 URL: https://issues.apache.org/jira/browse/SOLR-13752
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: MoreLikeThis
            Reporter: Andy Hind



MLT always uses the total doc count and not the count of docs with the specific 
field

 

To quote Maria Mestre from the discussion on the mailing list - 29/01/19

 
{quote}The issue I have is that when retrieving the key scored terms 
(interestingTerms), the code uses the total number of documents in the index, 
not the total number of documents with populated “description” field. This is 
where it’s done in the code: 
[https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651&d=DwIFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=XIYHWqjoenB2nuyYPl8m6c5xBIOD8PZJ4CWx0j6tQjA&m=gYOyL1Msgk2dpzigOsIvXq3CiFF0T7ApMLBVVDKW2dQ&s=v4mgEvgP3HWtMZcL3FTiKeY2nBOPJpTypmCpCBwPkQs&e=]

The effect of this choice is that the “idf” does not vary much, given that 
numDocs >> number of documents with “description”, so the key terms end up 
being just the terms with the highest term frequencies.

It is inconsistent because the MLT-search then uses these extracted key terms 
and scores all documents using an idf which is computed only on the subset of 
documents with “description”. So one part of the MLT uses a different numDocs 
than another part. This sounds like an odd choice, and not expected at all, and 
I wonder if I’m missing something.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-13752) MoreLikeThis MLT is biased for uncommon fields

Reply via email to