[ 
https://issues.apache.org/jira/browse/STANBOL-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-1102:
-----------------------------------------

    Description: 
With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)" 
configuration the EntityLinking Engine does support OR queries for multiple 
linkable/matchable tokens to the controlled vocabulary (default=2). 

This feature ensures that Entities that do match longer section in the text are 
higher ranked. This is especially important for bigger vocabularies and/or 
common tokens within the vocabulary as the EntityLinking only considers the top 
10 (or 3 * max suggestions) query results. 

However in case multiple Tokens are used for searches there might be 
suggestions that do match some tokens in the Text, but not the currently active 
one. Currently those suggestions are taken into account what can cause unwanted 
states, like the one described in the following Example:

    "Bei einer gmeinsamen Pressekonferenz mit FPÖ-Bundesparteivorsitzenden 
Heinz-Christian Strache in Langenlois" 

This generates the following queries

(1) process Token 5: FPÖ
  >> searchStrings [FPÖ, Bundesparteivorsitzenden]
  << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.013vy8

(2) process Token 5: Bundesparteivorsitzenden
  >> searchStrings [Bundesparteivorsitzenden, Heinz]
 << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.0c5y96

(3) process Token 7: Christian
  >> searchStrings [Christian, Strache]
 << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3] 
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for 
http://rdf.freebase.com/ns/m.08lfdk

resulting in a situation where Heinz is linked to an other Entity while 
Heinz-Christian Strache - while completely matching the text - is only linked 
with "Christian Strache" AND a lower confidence!

The issue is that search (2) issued for the Token "Bundesparteivorsitzenden" 
MUST NOT suggest an Entity that does not match the currently active Token. 
Because this is the case in the given Example "Heinz" is already consumed and 
can not be linked with the expected Entity mention "Heinz-Christian Strache"

This issue will add a rule to the Label <-> Text matching that an Label MUST 
match the currently active token in the text.


  was:
With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)" 
configuration the EntityLinking Engine does support OR queries for multiple 
linkable/matchable tokens to the controlled vocabulary (default=2). 

This feature ensures that Entities that do match longer section in the text are 
higher ranked. This is especially important for bigger vocabularies and/or 
common tokens within the vocabulary as the EntityLinking only considers the top 
10 (or 3 * max suggestions) query results. 

However in cases where no Entities do match several tokens of the search this 
feature currently causes unwanted side effects that is may match single tokens 
that are not the currently active one. 

E.g. the text section "Bei einer gmeinsamen Pressekonferenz mit 
FPÖ-Bundesparteivorsitzenden Heinz-Christian Strache in Langenlois" generates 
the following queries

(1) process Token 5: FPÖ
  >> searchStrings [FPÖ, Bundesparteivorsitzenden]
  << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.013vy8

(2) process Token 5: Bundesparteivorsitzenden
  >> searchStrings [Bundesparteivorsitzenden, Heinz]
 << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
http://rdf.freebase.com/ns/m.0c5y96

(3) process Token 7: Christian
  >> searchStrings [Christian, Strache]
 << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3] 
score=0.6666666666666666[l=0.6666666666666666,t=1.0] for 
http://rdf.freebase.com/ns/m.08lfdk

resulting in a situation where Heinz is linked to an other Entity while 
Heinz-Christian Strache - while completely matching the text - is only linked 
with "Christian Strache" AND a lower confidence!

The issue is that search (2) issued for the Token "Bundesparteivorsitzenden" 
MUST NOT suggest an Entity that does not match the currently active Token. 
Because this is the case in the given Example "Heinz" is already consumed and 
can not be linked with the expected Entity mention "Heinz-Christian Strache"

This issue will add a rule to EntityLinking that the currently active Token 
need to be included in suggestions. 


    
> EntityLinking MUST only accept Suggestions for the current active Token
> -----------------------------------------------------------------------
>
>                 Key: STANBOL-1102
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1102
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> With the "Max Search Tokens (enhancer.engines.linking.maxSearchTokens)" 
> configuration the EntityLinking Engine does support OR queries for multiple 
> linkable/matchable tokens to the controlled vocabulary (default=2). 
> This feature ensures that Entities that do match longer section in the text 
> are higher ranked. This is especially important for bigger vocabularies 
> and/or common tokens within the vocabulary as the EntityLinking only 
> considers the top 10 (or 3 * max suggestions) query results. 
> However in case multiple Tokens are used for searches there might be 
> suggestions that do match some tokens in the Text, but not the currently 
> active one. Currently those suggestions are taken into account what can cause 
> unwanted states, like the one described in the following Example:
>     "Bei einer gmeinsamen Pressekonferenz mit FPÖ-Bundesparteivorsitzenden 
> Heinz-Christian Strache in Langenlois" 
> This generates the following queries
> (1) process Token 5: FPÖ
>   >> searchStrings [FPÖ, Bundesparteivorsitzenden]
>   << 0: FPÖ[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
> http://rdf.freebase.com/ns/m.013vy8
> (2) process Token 5: Bundesparteivorsitzenden
>   >> searchStrings [Bundesparteivorsitzenden, Heinz]
>  << 0: Heinz[m=FULL,s=1,c=1(1.0)/1] score=1.0[l=1.0,t=1.0] for 
> http://rdf.freebase.com/ns/m.0c5y96
> (3) process Token 7: Christian
>   >> searchStrings [Christian, Strache]
>  << 0: Heinz-Christian Strache[m=FULL,s=2,c=2(1.0)/3] 
> score=0.6666666666666666[l=0.6666666666666666,t=1.0] for 
> http://rdf.freebase.com/ns/m.08lfdk
> resulting in a situation where Heinz is linked to an other Entity while 
> Heinz-Christian Strache - while completely matching the text - is only linked 
> with "Christian Strache" AND a lower confidence!
> The issue is that search (2) issued for the Token "Bundesparteivorsitzenden" 
> MUST NOT suggest an Entity that does not match the currently active Token. 
> Because this is the case in the given Example "Heinz" is already consumed and 
> can not be linked with the expected Entity mention "Heinz-Christian Strache"
> This issue will add a rule to the Label <-> Text matching that an Label MUST 
> match the currently active token in the text.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to