[
https://issues.apache.org/jira/browse/STANBOL-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler resolved STANBOL-1262.
------------------------------------------
Resolution: Fixed
Fix Version/s: 0.12.0
implemented with http://svn.apache.org/r1560281 in 0.12 and merged to trunk
with http://svn.apache.org/r1560286
> Change/Improve processing of Chunks by EntityLinking
> -----------------------------------------------------
>
> Key: STANBOL-1262
> URL: https://issues.apache.org/jira/browse/STANBOL-1262
> Project: Stanbol
> Issue Type: Improvement
> Affects Versions: 0.12.0
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
> Fix For: 0.12.0
>
>
> The first step of EntityLinking (applies to all EntityLinkingEngines incl.
> the Lucene FST Linking Engine) is that it classifies Tokens as "linkable",
> "matchable" and "others". In addition it determines "processible" chunks
> Tokens are contained in.
> This issue is about changing the way how "processible" chunks are determined
> if the AnalyzedText contains multiple overlapping chunks.
> A typical case where this can happen is if both a Noun Phrase Detection and a
> Named Entity Recognition is contained in the Chain. The chunks selected by
> Named Entities will typically be smaller as the corresponding Noun Phrase.
> There are even situations where the Named Entity does not even include all
> Nouns contained in a Noun Phrase.
> Here an Example taken from [1]:
> After a disappointing start against an Everton side who led through Kevin
> Mirallas's first-half goal ...
> While "Everton" is detected as Organization by NER, the Noun Phrase "an
> Everton side" also include 'side' as an 2nd noun. Therefore 'Everton' is not
> considered for linking as it only matches a 1/2 matchable tokens within a
> 'processible phrase'
> This is because EntityLinking currently merges overlapping processible phrase
> together. A semantic that is - no longer - an optimal for EntityLinking.
> To avoid recall problems like described the last Chunk emitted by the
> AnalyzedText should be used instead. For the above example this would result
> in
> - an [other]: an Everton side
> - Everton [linkable]: Everton
> - side [matchable]: an Everton side
> So 'Everton' would get correctly linked to an Entity with the label Everton
> but 'side' would not get linked to an Entity with the label Side, as it is in
> a Phrase with an other linkable/matchable token.
> An other example would be ' ... the University of Munich is ... ' where one
> could expect Noun Phrases for 'the Univerity' and 'Munich' (if single token
> noun phrases are emitted by the chunker component). In addition as a result
> of the NER engine one can expect a chunk for 'Univerity of Munich'.
> - the [other]: the University
> - University [matchable]: University of Munich
> - of [other]: University of Munich
> - Munich [linkable]: Munich
> This would result in the linking rules that 'University' is only linked to
> Entities that also match Munich in their Label while Munich would be also
> linked to Entities that just include Munich. A small differentiation to the
> current implementation where Munich alone would not get linked as all the
> chunks would get merged to a big one covering 'the University of Munich'.
> [1]
> http://www.theguardian.com/football/2014/jan/20/west-bromwich-albion-everton-premier-league-match-report
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)