[jira] [Commented] (STANBOL-1027) SolrYard should compensate for Solr lengthNorm on multi valued fields

Rupert Westenthaler (JIRA) Fri, 12 Apr 2013 01:09:19 -0700

    [ 
https://issues.apache.org/jira/browse/STANBOL-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629908#comment-13629908
 ]


Rupert Westenthaler commented on STANBOL-1027:
----------------------------------------------

Note that this will only apply to fields where boosts are enabled by the 
"org.apache.stanbol.entityhub.yard.solr.fieldBoosts" configuration property.

>From the JavaDoc:

    /**
     * Key used to configure {@link Entry Entry&lt;String,Float&gt;} for fields 
with the boost. If no Map is
     * configured or a field is not present in the Map, than 1.0f is used as 
Boost. If a Document boost is
     * present than the boost of a Field is documentBoost*fieldBoost.
     */
    public static final String FIELD_BOOST_MAPPINGS = 
"org.apache.stanbol.entityhub.yard.solr.fieldBoosts";

For all other fields - where field does refer to fields in the Representation 
and may actually refer to multiple fields in the Solr index (e.g. for different 
languages of values for the same field) - no index time boost will be set at 
all.

                
> SolrYard should compensate for Solr lengthNorm on multi valued fields
> ---------------------------------------------------------------------
>
>                 Key: STANBOL-1027
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1027
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Entityhub
>    Affects Versions: entityhub-0.11.0
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Currently Entities with multiple labels are down ranked for Queries if 
> omitNorms is not disabled by the Solr schema.xml. Typically the solution to 
> this problem is to enable omitNorms in for such fields. However this also 
> means that no index time boosts can be applied to such fields.
> However for Entity lookups of the Stanbol Enhancer it is important to have 
> both:
> * index time boosts based on the ranking of the Entity within the knowledge 
> base (e.g. the number of incoming links in dbpedia)
> * compensation to the Solr lengthNorm in case of multiple labels
> This is especially important as often very prominent Entities do have a lot 
> of alternate Labels defined and therefore suffer most from this.
> from http://lucene.472066.n3.nabble.com/QueryNorm-and-FieldNorm-td1992964.html
> > 1. lengthNorm = measure of the importance of a term according to the total 
> > number of terms in the field
> >   1. Implementation: 1/sqrt(numTerms)
> >   2. Implication: a term matched in fields with less terms have a higher 
> > score
> >   3. Rationale: a term in a field with less terms is more important than 
> > one with more
> > 2. boost (index) = boost of the field at index-time
> >   1. Index time boost specified. The fieldNorm value in the score would 
> > include the same.
> >   3. boost (query) = boost of the field at query-time
> As an example we will use the Entity "Barack Obama" on freebase.com has 20 
> alternate names for the English language. All 21 labels (1 label + 20 
> alternate) with a sum of 56 Tokens where 19 of them are unique. Further we 
> will compare them with an other Entity also named "Barack Obama" that does 
> not have any alternate labels.
> When indexing all those labels to a single (multivalued) text field we will 
> end up with a lengthNorm of 1/sqrt(56) = 0.133, while the 2nd Entity will 
> have a lengthNorm of 1/sqrt(2) = 0,707. Meaning that the 2nd Entity is 
> boosted relative to the first by a factor of ~5 (500%).
> To compensate this the suggestion is to apply an index time boost of 
> sqrt(numLabels). Doing so would change the situation as follows
> Entity 1 would get a combined norm of sqrt(20)/sqrt(56)=0,598 while the 2nd 
> Entity would stay at 0,707 resulting in an boost of the 2nd entity by about 
> 20%.
> NOTE: that based on my readings of the Lucene/Solr documentation the fact 
> that the term "Obama" is contained 16 times in the label of the first Entity 
> does not influence scoring of term queries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-1027) SolrYard should compensate for Solr lengthNorm on multi valued fields

Reply via email to