[jira] Commented: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Graham Poulter (JIRA) Wed, 25 Nov 2009 04:29:05 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782388#action_12782388
 ]


Graham Poulter commented on SOLR-1599:
--------------------------------------

This is what could happen when indexing multiple entity types in the same core. 
For instance, indexing artists and tracks and using a filter to "search for 
artists". You then search for artists, with two dismax terms _A_ or _B_ on the 
_name_ field.  Term _A_ is rare amongst artist _name_, so it should have a low 
docFreq and a relatively high weight compared to term _B_.   However, term _A_ 
happens to be common in track _name_, so its docFreq is higher, making the IDF 
weight for _A_ lower than it should be relative to term _B_.  The filtered-out 
track instances are invisibly modifying the weight of query terms in a query 
for artists, which would not happen with separate indeces (and thus separate 
docFreq's)

> Improve IDF and relevance by separately indexing different entity types 
> sharing a common schema
> -----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1599
>                 URL: https://issues.apache.org/jira/browse/SOLR-1599
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>            Reporter: Graham Poulter
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the 
> documents in an index.  This introduces relevance problems when using a 
> single schema to store multiple entity types, for example to support "search 
> for tracks" and "search for artists".   The ranking for search on the _name_ 
> field of _track_ entities will be (much?) more accurate if the IDF for the 
> name field does not include counts from _artist_ entities.  The effect on 
> ranking would be most pronounced for query terms that have a low document 
> frequency for _track_ entities but a high frequency for _artist_ entities, or 
> visa versa.
> The current work-around to make the IDF be entity-specific is to use a 
> separate Solr core for each entity type sharing the schema - and repeating 
> the process of copying solrconfig.xml and schema.xml to all the cores.  This 
> would be more complicated with replication, and more so with sharding, to 
> maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed 
> SOLR-1158, where he suggests calculating _numDocs_ after the application of 
> filters.  He recognises however that the document frequency (DF_t) for each 
> query term in a _track_ search would also needs to exclude _artist_ entities 
> from the DF_t total to get the correct IDF_t=log(N/DF_t).   DF_t must be 
> calculated at index time, when Solr does not know what filters will be 
> applied.
> I suggest having a metadata field _entitytype_ specified on submitting a 
> batch of documents. The the schema would specify a list of allowed entity 
> types and a default entity type. For example, document could say either 
> entitytype="track" or entitytype="artist".  Each each entity type has an 
> independent set of document frequencies, so the term "foo" will have a DF for 
> entitytype="artist" and a different DF for entitytype="track".   This might 
> be implemented by instantiating a separate Lucene index for each configured 
> entity type.  Filtering on entitytype="artist" would be implemented by 
> searching only the _artist_ index, analogous to searching only on the 
> _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate 
> Lucene indeces) a single Solr core can support many different entity types 
> that share a common schema but use partially overlapping subsets of fields, 
> instead of configuring, replicating and sharding a Solr core for every entity 
> type.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1599) Improve IDF and relevance by separately indexing different entity types sharing a common schema

Reply via email to