Rupert Westenthaler created STANBOL-1102:
--------------------------------------------
Summary: EntityLinking MUST only accept single token matches for
the currently active Token
Key: STANBOL-1102
URL: https://issues.apache.org/jira/browse/STANBOL-1102
Project: Stanbol
Issue Type: Bug
Reporter: Rupert Westenthaler
Assignee: Rupert Westenthaler
With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)"
configuration the EntityLinking Engine does support OR queries for multiple
linkable/matchable tokens to the controlled vocabulary (default=2).
This feature ensures that Entities that do match longer section in the text are
higher ranked. This is especially important for bigger vocabularies and/or
common tokens within the vocabulary as the EntityLinking only considers the top
10 (or 3 * max suggestions) query results.
However in cases where no Entities do match several tokens of the search this
feature currently causes unwanted side effects that is may match single tokens
that are not the currently active one.
E.g. the text section "Bei einer gmeinsamen Pressekonferenz mit
FPÖ-Bundesparteivorsitzenden Heinz-Christian Strache in Langenlois" generates
the following queries
(1) process Token 5: FPÖ
>> searchStrings [FPÖ, Bundesparteivorsitzenden]
<< 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.013vy8
(2) process Token 5: Bundesparteivorsitzenden
>> searchStrings [Bundesparteivorsitzenden, Heinz]
<< 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for
http://rdf.freebase.com/ns/m.0c5y96
(3) process Token 7: Christian
>> searchStrings [Christian, Strache]
<< 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3]
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for
http://rdf.freebase.com/ns/m.08lfdk
resulting in a situation where Heinz is linked to an other Entity while
Heinz-Christian Strache - while completely matching the text - is only linked
with "Christian Strache" AND a lower confidence!
The issue is that search (2) issued for the Token "Bundesparteivorsitzenden"
MUST NOT suggest an Entity that does not match the currently active Token.
Because this is the case in the given Example "Heinz" is already consumed and
can not be linked with the expected Entity mention "Heinz-Christian Strache"
This issue will add a rule to EntityLinking that the currently active Token
need to be included in suggestions.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira