[jira] [Updated] (STANBOL-1049) Add support for Upper Case Linking for Languages without NLP support

Rupert Westenthaler (JIRA) Wed, 16 Oct 2013 22:09:32 -0700

     [ 
https://issues.apache.org/jira/browse/STANBOL-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rupert Westenthaler updated STANBOL-1049:
-----------------------------------------

    Fix Version/s: 0.12.0

> Add support for Upper Case Linking for Languages without NLP support
> --------------------------------------------------------------------
>
>                 Key: STANBOL-1049
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1049
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancement Engines
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>             Fix For: 0.12.0
>
>
> This issue will allow the EntityLinkingEngine to use upper case token 
> information for linking of languages without NLP support. 
> If TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled 
> AND the language of the processed text uses bicameral script (alphabet with 
> upper case letters) only upper case tokens that are equals or longer as 
> TextProcessingConfig#minSearchTokenLength will be marked as 'linkable'. This 
> will allow to avoid vocabulary lookups for lower case Tokens and therefore 
> dramatically improve performance for processing languages without POS tagging 
> support.
> ---
> Definitions:
> -------
> The EntityLinking Engine distinguishes three (Token Types)[1]:
> * Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
> * Matchable Token: A Word that is used to search and match Entities, but does 
> not trigger an lookup
> * Other Tokens: Not used for search and matching. Might be used for fine 
> tuning confidence values.
> Language level information incude
> * isUnicaseScript [true, false]: If the processed language uses a unicase 
> script - does not know upper case letters
> Token level information include
> * hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
> * hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
> * isUpperCase [true,false]: If the first letter is an upper case one
> * hasAlphaNumeric [true,false]: if the word has an alpha numeric char
> * hasSearchableLength [true,false]: if the word is longer as the configured 
> "Min Search Token Length"
> * isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.
> Algorithm:
> ------
> This describes the algorithm used to classify Tokens as linkable, matchable 
> and other based on the above properties. Rules are applied in the given 
> order. A summary of the result for Tokens with no POS tags is given in the 
> next section 
> __1. Basic rules:__
> * all Tokens without an AlphaNumeric character are not linkage and matchable
> * all tokens with hasLinkablePos are linkable
> * all linkable tokens and tokens with hasMatchablePOS are matchable
> __2. Uppercase Processing Rules__
> This rules are applied to UpperCase tokens that are not at a sentence or 
> subSentence start 
> * if TextProcessingConfig#LinkUpperCaseTokens is enabled
>     * all tokens with hasMatchablePOS == true are also marked as linkable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> * if TextProcessingConfig#MatchUpperCaseTokens is enable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> __3. Unknown POS tag Rules__
> This rules only apply to Tokens that do have AlphaNumeric characters and  
> where both hasLinkablePos == null and hasMatchablePos == null
> * if the processed language uses a unicase script or 
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is disabled
>     * all tokens equals or longer then  
> TextProcessingConfig#minSearchTokenLength are marked as linkable
> * else - bicameral script and 
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
>     * if UpperCase token and not sentence or sub-sentence start
>         * tokens equals or longer as 
> TextProcessingConfig#minSearchTokenLength are marked as linkable
>         * tokens shorter as TextProcessingConfig#minSearchTokenLength are 
> marked as matchable
>     * else - lower case token or sentence or sub-sentence start
>         * tokens equals or longer as 
> TextProcessingConfig#minSearchTokenLength are marked as matchable
> Languages without NLP support
> -----
> For languages without NLP processing support - meaning that no POS tagging is 
> availabel - the following configurations are important
> * linkOnlyUpperCaseTokensWithMissingPosTag: This indicates that only upper 
> case Tokens should be considered for linking. Note that this option is 
> ignored for languages with a unicase script - scripts that do not use upper 
> case characters.
> * minSearchTokenLength: This indicates that only words with equals or more as 
> the configured characters should be considered for linking
> By default the 'linkOnlyUpperCaseTokensWithMissingPosTag' has the same value 
> as the 'properNounsState' configuration. This means that if the "link only 
> proper nouns" option is enabled only upper case tokens will be linked for 
> languages without POS support. The default for the minSearchTokenLength is 3 
> letters.
> [1] 
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (STANBOL-1049) Add support for Upper Case Linking for Languages without NLP support

Reply via email to