[
https://issues.apache.org/jira/browse/STANBOL-1027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13629908#comment-13629908
]
Rupert Westenthaler commented on STANBOL-1027:
----------------------------------------------
Note that this will only apply to fields where boosts are enabled by the
"org.apache.stanbol.entityhub.yard.solr.fieldBoosts" configuration property.
>From the JavaDoc:
/**
* Key used to configure {@link Entry Entry<String,Float>} for fields
with the boost. If no Map is
* configured or a field is not present in the Map, than 1.0f is used as
Boost. If a Document boost is
* present than the boost of a Field is documentBoost*fieldBoost.
*/
public static final String FIELD_BOOST_MAPPINGS =
"org.apache.stanbol.entityhub.yard.solr.fieldBoosts";
For all other fields - where field does refer to fields in the Representation
and may actually refer to multiple fields in the Solr index (e.g. for different
languages of values for the same field) - no index time boost will be set at
all.
> SolrYard should compensate for Solr lengthNorm on multi valued fields
> ---------------------------------------------------------------------
>
> Key: STANBOL-1027
> URL: https://issues.apache.org/jira/browse/STANBOL-1027
> Project: Stanbol
> Issue Type: Improvement
> Components: Entityhub
> Affects Versions: entityhub-0.11.0
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Currently Entities with multiple labels are down ranked for Queries if
> omitNorms is not disabled by the Solr schema.xml. Typically the solution to
> this problem is to enable omitNorms in for such fields. However this also
> means that no index time boosts can be applied to such fields.
> However for Entity lookups of the Stanbol Enhancer it is important to have
> both:
> * index time boosts based on the ranking of the Entity within the knowledge
> base (e.g. the number of incoming links in dbpedia)
> * compensation to the Solr lengthNorm in case of multiple labels
> This is especially important as often very prominent Entities do have a lot
> of alternate Labels defined and therefore suffer most from this.
> from http://lucene.472066.n3.nabble.com/QueryNorm-and-FieldNorm-td1992964.html
> > 1. lengthNorm = measure of the importance of a term according to the total
> > number of terms in the field
> > 1. Implementation: 1/sqrt(numTerms)
> > 2. Implication: a term matched in fields with less terms have a higher
> > score
> > 3. Rationale: a term in a field with less terms is more important than
> > one with more
> > 2. boost (index) = boost of the field at index-time
> > 1. Index time boost specified. The fieldNorm value in the score would
> > include the same.
> > 3. boost (query) = boost of the field at query-time
> As an example we will use the Entity "Barack Obama" on freebase.com has 20
> alternate names for the English language. All 21 labels (1 label + 20
> alternate) with a sum of 56 Tokens where 19 of them are unique. Further we
> will compare them with an other Entity also named "Barack Obama" that does
> not have any alternate labels.
> When indexing all those labels to a single (multivalued) text field we will
> end up with a lengthNorm of 1/sqrt(56) = 0.133, while the 2nd Entity will
> have a lengthNorm of 1/sqrt(2) = 0,707. Meaning that the 2nd Entity is
> boosted relative to the first by a factor of ~5 (500%).
> To compensate this the suggestion is to apply an index time boost of
> sqrt(numLabels). Doing so would change the situation as follows
> Entity 1 would get a combined norm of sqrt(20)/sqrt(56)=0,598 while the 2nd
> Entity would stay at 0,707 resulting in an boost of the 2nd entity by about
> 20%.
> NOTE: that based on my readings of the Lucene/Solr documentation the fact
> that the term "Obama" is contained 16 times in the label of the first Entity
> does not influence scoring of term queries.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira