Rupert Westenthaler created STANBOL-1122:
--------------------------------------------

             Summary: Only Tokens with a fully linked entity should be marked 
as consumed
                 Key: STANBOL-1122
                 URL: https://issues.apache.org/jira/browse/STANBOL-1122
             Project: Stanbol
          Issue Type: Sub-task
          Components: Enhancement Engines
            Reporter: Rupert Westenthaler
            Assignee: Rupert Westenthaler


The EntityLinking process makes Token that are already linked with an Entity as 
"consumed". 

Lets asume a text mentions:

    "An airplane crashed in the northern part of the Democratic Republic of the 
Congo"

In case Proper Noun linking is activated "Democratic" would be the first 
"active" token within this sentence and ["Democratic", "Republic"] would be the 
first "search tokens". Now lets assume that the vocabulary contains the Entity 
"Democratic Republic of the Congo" and that is is returned by the 
EntitySearcher for a query for ["Democratic", "Republic"].

So when the Entity "Democratic Republic of the Congo" is matched with the 
sentence all tokens until "Congo" are marked as consumed. This ensures that 
there are no further lookups for "Republic" nor "Congo".

While this is generally good suggested Entities that do exactly match the text 
it is dangerous for partial matches as shown by the following example

    "President Barack Obama said the US estimated ..."

If you link this text to Freebase, than "Presidency of Barack Obama" 
(https://www.freebase.com/m/05b6w1g) will get linked for the section "President 
Barack Obama". The match is "Particial" as only tree of the four tokens of the 
label do match the Text and also the not exact match of "Presidency" with 
"President" does reduce the confidence to an overall score of about 0.6

However the current algorithm would still mark "Barack" and "Obama" as consumed 
and therefore prevent "Barack Obama" to be linked for this mention.

This issue will change this in a way that only FULL matches (where all tokens 
in the label do match tokens in the text) will mark Entities in the text as 
consumed. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to