Rupert Westenthaler created STANBOL-1102:
--------------------------------------------

             Summary: EntityLinking MUST only accept single token matches for 
the currently active Token
                 Key: STANBOL-1102
                 URL: https://issues.apache.org/jira/browse/STANBOL-1102
             Project: Stanbol
          Issue Type: Bug
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)" 
configuration the EntityLinking Engine does support OR queries for multiple 
linkable/matchable tokens to the controlled vocabulary (default=2). 

This feature ensures that Entities that do match longer section in the text are 
higher ranked. This is especially important for bigger vocabularies and/or 
common tokens within the vocabulary as the EntityLinking only considers the top 
10 (or 3 * max suggestions) query results. 

However in cases where no Entities do match several tokens of the search this 
feature currently causes unwanted side effects that is may match single tokens 
that are not the currently active one. 

E.g. the text section "Bei einer gmeinsamen Pressekonferenz mit 
FPÖ-Bundesparteivorsitzenden Heinz-Christian Strache in Langenlois" generates 
the following queries

(1) process Token 5: FPÖ
  >> searchStrings [FPÖ, Bundesparteivorsitzenden]
  << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.013vy8

(2) process Token 5: Bundesparteivorsitzenden
  >> searchStrings [Bundesparteivorsitzenden, Heinz]
 << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.0c5y96

(3) process Token 7: Christian
  >> searchStrings [Christian, Strache]
 << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3] 
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for 
http://rdf.freebase.com/ns/m.08lfdk

resulting in a situation where Heinz is linked to an other Entity while 
Heinz-Christian Strache - while completely matching the text - is only linked 
with "Christian Strache" AND a lower confidence!

The issue is that search (2) issued for the Token "Bundesparteivorsitzenden" 
MUST NOT suggest an Entity that does not match the currently active Token. 
Because this is the case in the given Example "Heinz" is already consumed and 
can not be linked with the expected Entity mention "Heinz-Christian Strache"

This issue will add a rule to EntityLinking that the currently active Token 
need to be included in suggestions. 


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to