[
https://issues.apache.org/jira/browse/SOLR-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Graham Poulter updated SOLR-1599:
---------------------------------
Description:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the
documents in an index. This introduces relevance problems when using a single
schema to store multiple entity types, for example to support "search for
tracks" and "search for artists". The ranking for search on the _name_ field
of _track_ entities will be (much?) more accurate if the IDF for the name field
does not include counts from _artist_ entities. The effect on ranking would be
most pronounced for query terms that have a low document frequency for _track_
entities but a high frequency for _artist_ entities, or visa versa.
The current work-around to make the IDF be entity-specific is to use a separate
Solr core for each entity type sharing the schema - and repeating the process
of copying solrconfig.xml and schema.xml to all the cores. This would be more
complicated with replication, and more so with sharding, to maintain a core for
_artists_ and a core for _tracks_ on each node.
David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed
SOLR-1158, where he suggests calculating _numDocs_ after the application of
filters. He recognises however that the document frequency (DF_t) for each
query term in a _track_ search would also needs to exclude _artist_ entities
from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be
calculated at index time, when Solr does not know what filters will be applied.
I suggest having a metadata field _entitytype_ specified on submitting a batch
of documents. The the schema would specify a list of allowed entity types and a
default entity type. For example, document could say either entitytype="track"
or entitytype="artist". Each each entity type has an independent set of
document frequencies, so the term "foo" will have a DF for entitytype="artist"
and a different DF for entitytype="track". This might be implemented by
instantiating a separate Lucene index for each configured entity type.
Filtering on entitytype="artist" would be implemented by searching only the
_artist_ index, analogous to searching only on the _artist_ core in the
multi-core workaround.
With this solution (entity type metadata field implemented with separate Lucene
indeces) a single Solr core can support many different entity types that share
a common schema but use partially overlapping subsets of fields, instead of
configuring, replicating and sharding a Solr core for every entity type.
was:
In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the
documents in an index. This introduces relevance problems when using a single
schema to store multiple entity types, for example to support "search for
tracks" and "search for artists". The ranking for search on the _name_ field
of _track_ entities will be (much?) more accurate if the IDF for the name field
does not include counts from _artist_ entities. The effect on ranking would be
most pronounced for query terms that have a low document frequency for _track_
entities but a high frequency for _artist_ entities, or visa versa.
The current work-around to make the IDF be entity-specific is to use a separate
Solr core for each entity type sharing the schema - and repeating the process
of copying solrconfig.xml and schema.xml to all the cores. This would be more
complicated with replication, and even more complicated with sharding, to
maintain a core for _artists_ and a core for _tracks_ on each node.
David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed
SOLR-1158, where he suggests calculating _numDocs_ after the application of
filters. He recognises however that the document frequency (DF_t) for each
query term in a _track_ search would also needs to exclude _artist_ entities
from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be
calculated at index time, when Solr does not know what filters will be applied.
I suggest having a metadata field _entitytype_ specified on submitting a batch
of documents. The the schema would specify a list of allowed entity types and a
default entity type. For example, document could say either entitytype="track"
or entitytype="artist". Each each entity type has an independent set of
document frequencies, so the term "foo" will have a DF for entitytype="artist"
and a different DF for entitytype="track". This might be implemented by
instantiating a separate Lucene index for each configured entity type.
Filtering on entitytype="artist" would be implemented by searching only the
_artist_ index, analogous to searching only on the _artist_ core in the
multi-core workaround.
With this solution (entity type metadata field implemented with separate Lucene
indeces) a single Solr core can support many different entity types that share
a common schema but use partially overlapping subsets of fields, instead of
configureing, replicating and shardoeg a Solr core for every entity type.
> Improve IDF and relevance by separately indexing different entity types
> sharing a common schema
> -----------------------------------------------------------------------------------------------
>
> Key: SOLR-1599
> URL: https://issues.apache.org/jira/browse/SOLR-1599
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Reporter: Graham Poulter
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> In Solr 1.4, the IDF (Inverse Document Frequency) is calculated on all of the
> documents in an index. This introduces relevance problems when using a
> single schema to store multiple entity types, for example to support "search
> for tracks" and "search for artists". The ranking for search on the _name_
> field of _track_ entities will be (much?) more accurate if the IDF for the
> name field does not include counts from _artist_ entities. The effect on
> ranking would be most pronounced for query terms that have a low document
> frequency for _track_ entities but a high frequency for _artist_ entities, or
> visa versa.
> The current work-around to make the IDF be entity-specific is to use a
> separate Solr core for each entity type sharing the schema - and repeating
> the process of copying solrconfig.xml and schema.xml to all the cores. This
> would be more complicated with replication, and more so with sharding, to
> maintain a core for _artists_ and a core for _tracks_ on each node.
> David Smiley, author of "Solr 1.4 Enterprise Search Server", has filed
> SOLR-1158, where he suggests calculating _numDocs_ after the application of
> filters. He recognises however that the document frequency (DF_t) for each
> query term in a _track_ search would also needs to exclude _artist_ entities
> from the DF_t total to get the correct IDF_t=log(N/DF_t). DF_t must be
> calculated at index time, when Solr does not know what filters will be
> applied.
> I suggest having a metadata field _entitytype_ specified on submitting a
> batch of documents. The the schema would specify a list of allowed entity
> types and a default entity type. For example, document could say either
> entitytype="track" or entitytype="artist". Each each entity type has an
> independent set of document frequencies, so the term "foo" will have a DF for
> entitytype="artist" and a different DF for entitytype="track". This might
> be implemented by instantiating a separate Lucene index for each configured
> entity type. Filtering on entitytype="artist" would be implemented by
> searching only the _artist_ index, analogous to searching only on the
> _artist_ core in the multi-core workaround.
> With this solution (entity type metadata field implemented with separate
> Lucene indeces) a single Solr core can support many different entity types
> that share a common schema but use partially overlapping subsets of fields,
> instead of configuring, replicating and sharding a Solr core for every entity
> type.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.