[
https://issues.apache.org/jira/browse/STANBOL-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rupert Westenthaler updated STANBOL-1049:
-----------------------------------------
Fix Version/s: 0.12.0
> Add support for Upper Case Linking for Languages without NLP support
> --------------------------------------------------------------------
>
> Key: STANBOL-1049
> URL: https://issues.apache.org/jira/browse/STANBOL-1049
> Project: Stanbol
> Issue Type: Improvement
> Components: Enhancement Engines
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
> Fix For: 0.12.0
>
>
> This issue will allow the EntityLinkingEngine to use upper case token
> information for linking of languages without NLP support.
> If TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
> AND the language of the processed text uses bicameral script (alphabet with
> upper case letters) only upper case tokens that are equals or longer as
> TextProcessingConfig#minSearchTokenLength will be marked as 'linkable'. This
> will allow to avoid vocabulary lookups for lower case Tokens and therefore
> dramatically improve performance for processing languages without POS tagging
> support.
> ---
> Definitions:
> -------
> The EntityLinking Engine distinguishes three (Token Types)[1]:
> * Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
> * Matchable Token: A Word that is used to search and match Entities, but does
> not trigger an lookup
> * Other Tokens: Not used for search and matching. Might be used for fine
> tuning confidence values.
> Language level information incude
> * isUnicaseScript [true, false]: If the processed language uses a unicase
> script - does not know upper case letters
> Token level information include
> * hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
> * hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
> * isUpperCase [true,false]: If the first letter is an upper case one
> * hasAlphaNumeric [true,false]: if the word has an alpha numeric char
> * hasSearchableLength [true,false]: if the word is longer as the configured
> "Min Search Token Length"
> * isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.
> Algorithm:
> ------
> This describes the algorithm used to classify Tokens as linkable, matchable
> and other based on the above properties. Rules are applied in the given
> order. A summary of the result for Tokens with no POS tags is given in the
> next section
> __1. Basic rules:__
> * all Tokens without an AlphaNumeric character are not linkage and matchable
> * all tokens with hasLinkablePos are linkable
> * all linkable tokens and tokens with hasMatchablePOS are matchable
> __2. Uppercase Processing Rules__
> This rules are applied to UpperCase tokens that are not at a sentence or
> subSentence start
> * if TextProcessingConfig#LinkUpperCaseTokens is enabled
> * all tokens with hasMatchablePOS == true are also marked as linkable
> * all tokens with hasMatchablePOS == false are marked as matchable
> * if TextProcessingConfig#MatchUpperCaseTokens is enable
> * all tokens with hasMatchablePOS == false are marked as matchable
> __3. Unknown POS tag Rules__
> This rules only apply to Tokens that do have AlphaNumeric characters and
> where both hasLinkablePos == null and hasMatchablePos == null
> * if the processed language uses a unicase script or
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is disabled
> * all tokens equals or longer then
> TextProcessingConfig#minSearchTokenLength are marked as linkable
> * else - bicameral script and
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
> * if UpperCase token and not sentence or sub-sentence start
> * tokens equals or longer as
> TextProcessingConfig#minSearchTokenLength are marked as linkable
> * tokens shorter as TextProcessingConfig#minSearchTokenLength are
> marked as matchable
> * else - lower case token or sentence or sub-sentence start
> * tokens equals or longer as
> TextProcessingConfig#minSearchTokenLength are marked as matchable
> Languages without NLP support
> -----
> For languages without NLP processing support - meaning that no POS tagging is
> availabel - the following configurations are important
> * linkOnlyUpperCaseTokensWithMissingPosTag: This indicates that only upper
> case Tokens should be considered for linking. Note that this option is
> ignored for languages with a unicase script - scripts that do not use upper
> case characters.
> * minSearchTokenLength: This indicates that only words with equals or more as
> the configured characters should be considered for linking
> By default the 'linkOnlyUpperCaseTokensWithMissingPosTag' has the same value
> as the 'properNounsState' configuration. This means that if the "link only
> proper nouns" option is enabled only upper case tokens will be linked for
> languages without POS support. The default for the minSearchTokenLength is 3
> letters.
> [1]
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types
--
This message was sent by Atlassian JIRA
(v6.1#6144)