[ 
https://issues.apache.org/jira/browse/STANBOL-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rupert Westenthaler updated STANBOL-1049:
-----------------------------------------

    Description: 
This issue will allow the EntityLinkingEngine to use upper case token 
information for linking of languages without NLP support. 

If TextProcessingConfig#LinkUpperCaseTokens is enabled only upper case tokens 
that are equals or longer than the configured min search token length will be 
linked with the controlled vocabulary. Lower case Tokens equals or longer than 
the min search token length will be used for matching.

Deactivating TextProcessingConfig#LinkUpperCaseTokens will preserve the current 
behavior  where all Tokens with equals or more chars as the configured min 
search token length will be linked.

NOTE: that this will require to explicitly configure 

    {lang};uc=MATCH

for languages that do not upper case characters (e.g. Arabic)

---

Definitions:
-------

The EntityLinking Engine distinguishes three (Token Types)[1]:

* Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
* Matchable Token: A Word that is used to search and match Entities, but does 
not trigger an lookup
* Other Tokens: Not used for search and matching. Might be used for fine tuning 
confidence values.

Language level information incude

* isUnicaseScript [true, false]: If the processed language uses a unicase 
script - does not know upper case letters

Token level information include

* hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
* hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
* isUpperCase [true,false]: If the first letter is an upper case one
* hasAlphaNumeric [true,false]: if the word has an alpha numeric char
* hasSearchableLength [true,false]: if the word is longer as the configured 
"Min Search Token Length"
* isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.


Algorithm:
------

This describes the algorithm used to classify Tokens as linkable, matchable and 
other based on the above properties. Rules are applied in the given order. A 
summary of the result for Tokens with no POS tags is given in the next section 

__1. Basic rules:__

* all Tokens without an AlphaNumeric character are not linkage and matchable
* all tokens with hasLinkablePos are linkable
* all linkable tokens and tokens with hasMatchablePOS are matchable

__2. Uppercase Processing Rules__

This rules are applied to UpperCase tokens that are not at a sentence or 
subSentence start 

* if TextProcessingConfig#LinkUpperCaseTokens is enabled
    * all tokens with hasMatchablePOS == true are also marked as linkable
    * all tokens with hasMatchablePOS == false are marked as matchable
* if TextProcessingConfig#MatchUpperCaseTokens is enable
    * all tokens with hasMatchablePOS == false are marked as matchable

__3. Unknown POS tag Rules__

This rules only apply to Tokens that do have AlphaNumeric characters and  where 
both hasLinkablePos == null and hasMatchablePos == null

* if the processed language uses a unicase script or 
TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is disabled
    * all tokens equals or longer then  
TextProcessingConfig#minSearchTokenLength are marked as linkable
* else - bicameral script and 
TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
    * if UpperCase token and not sentence or sub-sentence start
        * tokens equals or longer as TextProcessingConfig#minSearchTokenLength 
are marked as linkable
        * tokens shorter as TextProcessingConfig#minSearchTokenLength are 
marked as matchable
    * else - lower case token or sentence or sub-sentence start
        * tokens equals or longer as TextProcessingConfig#minSearchTokenLength 
are marked as matchable


Languages without NLP support
-----

For languages without NLP processing support - meaning that no POS tagging is 
availabel - the following configurations are important

* linkOnlyUpperCaseTokensWithMissingPosTag: This indicates that only upper case 
Tokens should be considered for linking. Note that this option is ignored for 
languages with a unicase script - scripts that do not use upper case characters.
* minSearchTokenLength: This indicates that only words with equals or more as 
the configured characters should be considered for linking

By default the 'linkOnlyUpperCaseTokensWithMissingPosTag' has the same value as 
the 'properNounsState' configuration. This means that if the "link only proper 
nouns" option is enabled only upper case tokens will be linked for languages 
without POS support. The default for the minSearchTokenLength is 3 letters.


[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types

  was:
This issue will allow the EntityLinkingEngine to use upper case token 
information for linking of languages without NLP support. 

If TextProcessingConfig#LinkUpperCaseTokens is enabled only upper case tokens 
that are equals or longer than the configured min search token length will be 
linked with the controlled vocabulary. Lower case Tokens equals or longer than 
the min search token length will be used for matching.

Deactivating TextProcessingConfig#LinkUpperCaseTokens will preserve the current 
behavior  where all Tokens with equals or more chars as the configured min 
search token length will be linked.

NOTE: that this will require to explicitly configure 

    {lang};uc=MATCH

for languages that do not upper case characters (e.g. Arabic)

---

Definitions:
-------

The EntityLinking Engine distinguishes three (Token Types)[1]:

* Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
* Matchable Token: A Word that is used to search and match Entities, but does 
not trigger an lookup
* Other Tokens: Not used for search and matching. Might be used for fine tuning 
confidence values.

Language level information incude

* isUnicaseScript [true, false]: If the processed language uses a unicase 
script - does not know upper case letters

Token level information include

* hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
* hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
* isUpperCase [true,false]: If the first letter is an upper case one
* hasAlphaNumeric [true,false]: if the word has an alpha numeric char
* hasSearchableLength [true,false]: if the word is longer as the configured 
"Min Search Token Length"
* isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.


Algorithm:
------

This describes the algorithm used to classify Tokens as linkable, matchable and 
other based on the above properties. Rules are applied in the given order. A 
summary of the result for Tokens with no POS tags is given in the next section 

__(1) Basic rules:__

* all Tokens without an AlphaNumeric character are not linkage and matchable
* all tokens with hasLinkablePos are linkable
* all linkable tokens and tokens with matchable POS matchable

__(2) Unknown POS tag Rules

This rules only apply to Tokens with AlphaNumeric characters that are not (yet) 
marked as linkable

* if the processed language uses a unicase script or 
    * all tokens equals or longer than  

__(2) Uppercase Rules__

This rules are applied to all none linkable token that are (1) upper case and 
(2) not at a sentence or subSentence start

* if TextProcessingConfig#LinkUpperCaseTokens is enabled
    * all matchable Tokens are also linkable
    * all other Tokens are converted to matchable
* if TextProcessingConfig#MatchUpperCaseTokens is enable
    * all other Tokens are converted to matchable
    * all Tokens with linkablePos == null and searchableLength are converted to 
linkable

__(3) Searchable Token Rules__

This rules are only applied to not linkable Tokens with hasLinkablePos == null 
and hasMatchablePos == null

* if  TextProcessingConfig#LinkUpperCaseTokens == false
    * all Tokens with searchableLength are marked as linkable


Languages without NLP support
-----

The above algorithm ensures that for languages without NLP support (no POS 
tags) Tokens are marked as follows:

__ LinkUpperCaseTokens is enabled __

* Linkable: All upper case tokens with a searchable length
* Matchable: All upper case tokens shorter as the min searchable length; All 
lower case tokens with a searchable length
* Other Tokens: All lower case tokens shorter as the min searchable length

__ LinkUpperCaseTokens is disabled __

* Linkable: All tokens with a searchable length
* Other Tokens: All tokens shorter as the min searchable length



[1] 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types

    
> Add support for Upper Case Linking for Languages without NLP support
> --------------------------------------------------------------------
>
>                 Key: STANBOL-1049
>                 URL: https://issues.apache.org/jira/browse/STANBOL-1049
>             Project: Stanbol
>          Issue Type: Improvement
>          Components: Enhancement Engines
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> This issue will allow the EntityLinkingEngine to use upper case token 
> information for linking of languages without NLP support. 
> If TextProcessingConfig#LinkUpperCaseTokens is enabled only upper case tokens 
> that are equals or longer than the configured min search token length will be 
> linked with the controlled vocabulary. Lower case Tokens equals or longer 
> than the min search token length will be used for matching.
> Deactivating TextProcessingConfig#LinkUpperCaseTokens will preserve the 
> current behavior  where all Tokens with equals or more chars as the 
> configured min search token length will be linked.
> NOTE: that this will require to explicitly configure 
>     {lang};uc=MATCH
> for languages that do not upper case characters (e.g. Arabic)
> ---
> Definitions:
> -------
> The EntityLinking Engine distinguishes three (Token Types)[1]:
> * Linkable Token: A Word that triggers a lookup in the Controlled Vocabulary
> * Matchable Token: A Word that is used to search and match Entities, but does 
> not trigger an lookup
> * Other Tokens: Not used for search and matching. Might be used for fine 
> tuning confidence values.
> Language level information incude
> * isUnicaseScript [true, false]: If the processed language uses a unicase 
> script - does not know upper case letters
> Token level information include
> * hasLinkablePos [true,null,flase]: If a POS tag matches the linkable POS
> * hasMatchablePos [true,null,false]: If a POS tag matches the processable POS
> * isUpperCase [true,false]: If the first letter is an upper case one
> * hasAlphaNumeric [true,false]: if the word has an alpha numeric char
> * hasSearchableLength [true,false]: if the word is longer as the configured 
> "Min Search Token Length"
> * isSubSentenceStart [true, false]: If the POS tag of an Token is Pos#Quote.
> Algorithm:
> ------
> This describes the algorithm used to classify Tokens as linkable, matchable 
> and other based on the above properties. Rules are applied in the given 
> order. A summary of the result for Tokens with no POS tags is given in the 
> next section 
> __1. Basic rules:__
> * all Tokens without an AlphaNumeric character are not linkage and matchable
> * all tokens with hasLinkablePos are linkable
> * all linkable tokens and tokens with hasMatchablePOS are matchable
> __2. Uppercase Processing Rules__
> This rules are applied to UpperCase tokens that are not at a sentence or 
> subSentence start 
> * if TextProcessingConfig#LinkUpperCaseTokens is enabled
>     * all tokens with hasMatchablePOS == true are also marked as linkable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> * if TextProcessingConfig#MatchUpperCaseTokens is enable
>     * all tokens with hasMatchablePOS == false are marked as matchable
> __3. Unknown POS tag Rules__
> This rules only apply to Tokens that do have AlphaNumeric characters and  
> where both hasLinkablePos == null and hasMatchablePos == null
> * if the processed language uses a unicase script or 
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is disabled
>     * all tokens equals or longer then  
> TextProcessingConfig#minSearchTokenLength are marked as linkable
> * else - bicameral script and 
> TextProcessingConfig#linkOnlyUpperCaseTokensWithMissingPosTag is enabled
>     * if UpperCase token and not sentence or sub-sentence start
>         * tokens equals or longer as 
> TextProcessingConfig#minSearchTokenLength are marked as linkable
>         * tokens shorter as TextProcessingConfig#minSearchTokenLength are 
> marked as matchable
>     * else - lower case token or sentence or sub-sentence start
>         * tokens equals or longer as 
> TextProcessingConfig#minSearchTokenLength are marked as matchable
> Languages without NLP support
> -----
> For languages without NLP processing support - meaning that no POS tagging is 
> availabel - the following configurations are important
> * linkOnlyUpperCaseTokensWithMissingPosTag: This indicates that only upper 
> case Tokens should be considered for linking. Note that this option is 
> ignored for languages with a unicase script - scripts that do not use upper 
> case characters.
> * minSearchTokenLength: This indicates that only words with equals or more as 
> the configured characters should be considered for linking
> By default the 'linkOnlyUpperCaseTokensWithMissingPosTag' has the same value 
> as the 'properNounsState' configuration. This means that if the "link only 
> proper nouns" option is enabled only upper case tokens will be linked for 
> languages without POS support. The default for the minSearchTokenLength is 3 
> letters.
> [1] 
> http://stanbol.apache.org/docs/trunk/components/enhancer/engines/entitylinking#token-types

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to