Rupert Westenthaler created STANBOL-1104:
--------------------------------------------

             Summary: Use Phrase queries for OR query terms in the SolrYard
                 Key: STANBOL-1104
                 URL: https://issues.apache.org/jira/browse/STANBOL-1104
             Project: Stanbol
          Issue Type: Improvement
          Components: Entityhub
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


Test for EntityLinking against big vocabularies (e.g. Freebase with about 40 
million entities) have shown that the currently used Solr Queries for 
multi-token OR queries do not always give the expected ranking of the results 
because of the following reasons:

ReferencedSite do use Entity rankings (implemented as index time Document 
boosts). Those rankings do have an impact on the rankings of query results. On 
the positive side those rankings ensure that a query for Paris should give 
Paris, France before Paris, Texas. On the negative for a query for two tokens 
(e.g. two given names) it might happen that other entities with only one of 
those terms (e.g. very famous person with one of the two requested given names) 
are ranked before entities with a lower ranking that do match both terms.

This is even more likely for terms that are very common in the index, as 
normalization will reduce the boost for entities with such a term - resulting 
in the document boost to have an even higher impact.

The described behavior is especially a problem for the EntityLinkingEngine as 
its uses exactly such kind of "{term1} OR {term2}" queries to lookup Entities. 


The use of a "Term Proximity" as suggested by [1] is clearly the best option to 
work around the stated problem: (1) Entities that do only match one of the 
parsed terms will get no boost from this part of the query, (2) even for 
entities that match several/all terms the ranking will get improved as the 
distance within the text will be considered for calculating the ranking.

However this will also have the consequence that queries for multiple OR 
connected terms will be more complex and need some additional time to process. 
The impact of this additional complexity will need to be investigated further.

Possible other Workarounds:

* disable the use of index time document boosts: However this would have a 
negative impact on every day searches (e.g. for Paris) and is therefore not an 
option within most scenarios.

* increase the number of selected entities for the EntityLinkingEngine: 
currently max(10,2*maxSuggestion) entities are retrieved. Increasing this value 
would make the engine more resistant to unexpected rankings. However (1) it 
does not solve (but workaround) the problem; (2) some tests have shown that 
even increasing the value to 50 does not include the expected result (using the 
freebase.com index as dataset).


So if the performance overhead allows to use of phrase queries this should be 
enabled for the Entityhub SolrYard. In case this has a considerable performance 
overhead this should become a new option that can be activated/deactivated.


[1] http://wiki.apache.org/solr/SolrRelevancyCookbook#Term_Proximity




One solution would be to 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to