[
https://issues.apache.org/jira/browse/STANBOL-740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486009#comment-13486009
]
Rupert Westenthaler edited comment on STANBOL-740 at 11/17/12 10:06 AM:
------------------------------------------------------------------------
With revision 1403242 a first implementation of the KeywordLinkingEngine that
is based on the Stanbol NLP prodessing Module (STABOL-733) is available in the
stanbol-nlp-processing branch [2]. This comment is intended to be moved to the
documentation of the Stanbol Webpage as soon as this version is re-integrated
to the trunk.
edit: brings the documentation in line with revision
http://svn.apache.org/viewvc?rev=1410404&view=rev
## Configuration
Only changes to the current version
### Removed Features
* Keyword Tokenizer
(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer): This
allowed to use a special Tokenizer for matching keywords and alpha numeric IDs.
This feature is no longer available as they KeywordLinkingEngine does no longer
the tokenizing of parsed texts and has therefore no influence on how the text
is tokenized. To preserve this feature a new Engine that is specialised for
this task needed to be implemented.
### Deprecated Configuration
* __Min POS Tag Probability__
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_:
While still functional users should use the "prob" param supported by the
_Processed Languages_ configuration. See the own section for details.
### Updated Features
* __Min Matched Tokens__
_org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens_
[1..*]::int: The minimum number of matching tokens. Only "matchable" tokens are
counted. For full matches (where all tokens of the Label do match tokens in the
text) this parameter is ignored.
This parameter is strongly related with the _Min Label Match Score_ Typical
setting are
1. _Min Matched Tokens_=1 and _Min Label Match Score_ > 0.5 (e.g. 0.75)
2. _Min Matched Tokens_=2 and _Min Label Match Score_ <= 0.5 (e.g. 0.5)
For Labels containing of one or two words both options do have the same
result, but for Longer labels (1) is more restrictive than (2). The important
thing is that both options ensures that Labels with more than one tokens will
not be considered if only a single token does match the text.
If used in combination with an disambiguation Engine one might want to
consider to suggest Entities where only a single token of multi-token labels do
match. In such cases a configuration like _Min Matched Tokens_=1 and _Min Label
Match Score_ <= 0.5 (e.g. 0.4) might be considered. With such scenarios users
will also want to considerable increase the value for _Max Suggestions_
(typically values > 10).
### New Features
* __Link ProperNouns only__
_(org.apache.stanbol.enhancer.engines.keywordextraction.properNounsState)_:
This boolean switch allows easily to switch between linking all nouns
(state=false) or only proper nouns (state=true). "Noun linking" is equivalent
to the current behavior of the KeywordLinkingEngine while "ProperNoun linking"
is more similar to using NER with the NamedEntityLinking engine. For linking
against vocabularies that contain Entities typically mentioned in texts as
ProperNouns activating this option will greatly improve performance as much
less words need to be looked-up in the Vocabulary. When linking to a Vocabulary
that defines Entities that might be mentioned as common nouns this option need
to be deactivated.
* __Processed Languages__
_(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_:
This features allows (1) to explicitly define languages processed by the Engine
and (2) allows to provide Language Specific Configurations. Language specific
configurations will override/extend engine global configurations. See the next
section for details
* __Max Search Token Distance__
_org.apache.stanbol.enhancer.engines.keywordextraction.maxSearchTokenDistance_
[1..*]::int: The maximum number of Tokens searched around linked Tokens to be
included in searches within the linked vocabulary (default value is '3'). As an
Example in the text section "at the University of Munich a new procedure to"
only "Munich" would be looked-up in the Vocabulary in case "ProperNoun"
linking is activated. However for searching possible matches in the Vocabulary
it makes sense to use additional Tokens to reduce (and better rank) possible
matches for for "Munich". Because of that "matchable" words surrounding
looked-up tokens are considered to be included for searches in the Vocabulary.
This parameter allows to configure the maximum distance of words that are used
for such searches. Note that this parameter will not cause Words outside of a
Chunk to be used for searches (unless "Ingore Chunks" option is activated).
* __Max Search Tokens__
_org.apache.stanbol.enhancer.engines.keywordextraction.maxSearchTokens_
[1..*]::int: The maximum number of Tokens used for searches in the Controlled
Vocabulary (default value is '2'). This sets the maximum number of Tokens used
in OR queries to the linked Vocabulary.
* _org.apache.stanbol.enhancer.engines.keywordextraction.lemmaMatching_
[true/false]: This allows to use the Lemma instead of the original token in the
text for searching and matching Labels of Entities. Currently this feature is
by default disabled
* __Min Label Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minLabelMatchFactor_
[0..1]::double: The "Label Score" [0..1] represents how much of the Label of an
Entity matches with the Text. It compares the number of Tokens of the Label
with the number of Tokens matched to the Text. Not exact matches for Tokens, or
if the Tokens within the label do appear in an other order than in the text do
also reduce this score. Entities are only considered if at least one of their
labels cores higher than the minimum for all tree of _Min Label Match Score_,
_Min Text Match Score_ and _Min Match Score_.
* __Min Text Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_
[0..1]::double: The "Text Score" [0..1] represents how well the Label of an
Entity matches to the selected Span in the Text. It compares the number of
matched {@link Token} from the label with the number of Tokens enclosed by the
Span in the Text an Entity is suggested for. Not exact matches for Tokens, or
if the Tokens within the label do appear in an other order than in the text do
also reduce this score. Entities are only considered if at least one of their
labels cores higher than the minimum for all tree of _Min Label Match Score_,
_Min Text Match Score_ and _Min Match Score_.
* __Min Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_
[0..1]::double: Defined as the product of the "Text Score" with the "Label
Score" - meaning that this value represents both how well the label matches the
text and how much of the label is matched with the text. Entities are only
considered if at least one of their labels cores higher than the minimum for
all tree of _Min Label Match Score_, _Min Text Match Score_ and _Min Match
Score_.
*_org.apache.stanbol.enhancer.engines.keywordextraction.dereferenceFields_:
Allows to define additional fields that are included for dereferneced Entities.
Only applied of "Dereference Entities" is enabled.
## Processed Language Configuration
With they key
_'org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages'_
the processed language(s) and language specific configurations can be applied.
The first usage of this configuration is to specify the languages processed by
the engine and second to parse language specific how languages should be
processed.
__1. Language Configuraiton:__
For the configuration of the processed languages the following syntax is used:
de
en
This would configure the Engine to only process German and English texts. It is
also possible to explicitly exclude languages
!fr
!it
*
This specifies that all Languages other than French and Italien are processed.
Values MUST BE parsed as Array or Vector. This is done by using the
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following
example shows the two above examples combined to a single configuration.
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
__2. Language specific Parameter__
In addition to specifying the processed languages this configuration can also
be used to parse language specific parameters. The syntax for parameters is as
follows
{language};{param-name}={param-value};{param-name}={param-value}
*;{param-name}={param-value};{param-name}={param-value}
;{param-name}={param-value};{param-name}={param-value}
The first line sets the parameter for {language}. The 2nd and 3rd line show
that either the wildcard language '*' or the empty language '' can be used to
configure parameters that are used as defaults for all languages.
### Language Processing Parameters
The following param-names are supported by the KeywordLinkingEngine
__Phrase level Parameters:__
* __pc__ {name}::LexicalCategory - The _Phrase Categories_ processed by the
Engine. Valid values include the name's of members of the LexicalCategory
enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", ...)
* __ptag__ {tag}::String - the _Phrase Tag_ processed by the Engine. This
allows to configure the String tags as used by the Chunker of a Language. This
should only be used of the Chunk types of the Chunker are not mapped with
members of the LexicalCategory enumeration.
* __pprob__ [0..1)::double - the _Min Phrase Tag Probability_ for Chunks to be
accepted as processable ('value/2' is sufficient for rejecting).
* __lmmtip__ [''/true/false]::boolean - the _Link Multiple Matchable Tokens in
Phrases_ parameter. As the name says it allows to enable/disable the linking of
multiple matchable tokens within the same Chunk. This is especially important
if _Proper Noun Linking_ is active, as it allows to detect 'named entities'
that are constituted by two common nouns. NOTE that 'lmmtip' is short for
'lmmtip=true'
__Token level Parameters:__
* __lc__ {name}::LexicalCategory - The linked _Token Categories_. Valid values
include the name's of members of the LexicalCategory enumeration (e.g. "Noun",
"Verb", "Adjective", "Adposition", …). Typical configurations include "lc=Noun"
or an empty list ("lc" or "lc=") to deactivate all categories and provide more
fine granular Pos or Tag level configuration.
* __pos__ {name}::Pos - This linked _Pos Types_. Valid values include the
name's of members of the Pos enumeration (e.g. "ProperNoun", "CommonNoun",
"Infinitive", "Gerund", "PresentParticiple" and ~150 others). This parameter
can be used to provide a very fine granular configuration. It is e.g. used by
the _Link ProperNouns only_ setting to define that only "pos=ProperNoun" are
linked.
* __tag__ {tag}::String - The linked _Pos Tags_. This parameter allows to
configure POS tags as used by the POS tagger. This is useful if those Tags are
not mapped to LexicalCategories or Pos types.
*__prob__ [0..1)::double - the _Min PosTag Probability_. This parameter
replaces the formally used _Min POS tag probability_
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_
property. It defines the minimum confidence so that a POS annotation is
accepted for linkable and matchable tokens ('value/2' is sufficient for
rejecting none linked/matched tokens).
* __uc__ {NONE/MATCH/LINK}::string - the _Upper Case Token Mode_ allows to
configure how upper case words are treated. There are three possible modes: (1)
NONE: defines that they are not specially treated; (2) MATCH defines that they
are considered as matchable tokens (independent of the POS tag or the token
length; (3) LINK: defines that they are in any case linked with the vocabulary.
The default is "LINK" - as upper case words often represent named entities -
with the exception of German ('de') where the mode is set to MATCH - as all
Nouns in German are upper case.
NOTE: that tokens are linked if any of "lc", "pos" or "tag" match the
configuration. This means that adding "lc=Noun" will render "pos=ProperNoun"
useless as the Pos type ProperNoun is already included in the LexicalCategory
Noun.
__Examples:__
The default configuration for the KeywordLinkingEngine uses the following
setting
*;lmmtip;uc=LINK;prop=0.75;pprob=0.75
de;uc=MATCH
es;lc=Noun
nl;lc=Noun
The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking
of upper case tokens for all languages. In addition it sets the minimum
probabilities for Pos- and Phrase annotations to 0.75 (what would be also the
default). The following three lines provide additional language specific
defaults. For German the upper case mode is reset to MATCH as in German all
Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun
is enabled. This is because the OpenNLP POS tagger for those languages does not
support ProperNoun's and therefore the Engine would not link any tokens if
_Link ProperNouns only_ is enabled. The same configuration in the OSGI
'.config' file syntax would look like follows
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
The 2nd example shows how to define default settings without using the wildcard
'*' that would enable processing of all languages. The following example shows
an configuration that only enables English and ignores text in all other
languages.
;lmmtip;uc=LINK;prop=0.75;pprob=0.75
en
de;uc=MATCH
## Extension Points
This section describes Interfaces that are used as Extension Points by the
KeywordLinkingEngine
### EntitySearcher
The EntitySearch Interface is used by the KeywordLinkingEngine to search for
Entities in the linked Vocabulary. Currently the StanbolEntityhub based
implementations are instantiated based on the value of the
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_.
Users that want to use a different implementation of this Interface to be used
for linking will need to extend the KeywordLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the KeywordLinkingEngine and are expected to
set/unset the #entitySearcher field.
### LabelTokenizer
The LabelTokenizer interface is used to tokenize labels of Entities from the
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be
tokenized before they can be matched against the current context in the Text.
LabelTokenizer are OSGI services. Their configuration optionally can define the
_'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property.
Values are considered to be language configurations. Configurations can
explicitly include/exclude languages. Also a wildcard is supported (e.g.
"en,de" would include English and German; "!it,!fr,*" would specify all
languages expect Italian and French. If no configuration is provided than "*"
(all languages) is assumed.
The KeywordLinkingEngine will - by default - always use the LabelTokenizer with
the highest "service.ranking" for a given language to tokenize labels. By
default it comes with an OpenNLP based Tokenizer implementation that registers
itself for all languages with a "service.ranking" of "-1000".
Users that want to use a different Tokenizer need to register an implementation
for the given language(s) with an higher "service.ranking". Users that want to
provide there own LabelTokenizer and ignore the values provided by OSGI need to
extend the KeywordLinkingEngine set the #labelTokenizer field themself AND
override the #bindLabelTokenizer(LabelTokenizerManager ltm) and
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do
NOT change the #labelTokenizer field.
[1] http://svn.apache.org/viewvc?rev=1403242&view=rev
[2] https://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing
was (Author: rwesten):
With revision 1403242 a first implementation of the KeywordLinkingEngine
that is based on the Stanbol NLP prodessing Module (STABOL-733) is available in
the stanbol-nlp-processing branch [2]. This comment is intended to be moved to
the documentation of the Stanbol Webpage as soon as this version is
re-integrated to the trunk.
## Configuration
Only changes to the current version
### Removed Features
* Keyword Tokenizer
(org.apache.stanbol.enhancer.engines.keywordextraction.keywordTokenizer): This
allowed to use a special Tokenizer for matching keywords and alpha numeric IDs.
This feature is no longer available as they KeywordLinkingEngine does no longer
the tokenizing of parsed texts and has therefore no influence on how the text
is tokenized. To preserve this feature a new Engine that is specialised for
this task needed to be implemented.
### New Features
* __Link ProperNouns only__
_(org.apache.stanbol.enhancer.engines.keywordextraction.properNounsState)_:
This boolean switch allows easily to switch between linking all nouns
(state=false) or only proper nouns (state=true). "Noun linking" is equivalent
to the current behavior of the KeywordLinkingEngine while "ProperNoun linking"
is more similar to using NER with the NamedEntityLinking engine. For linking
against vocabularies that contain Entities typically mentioned in texts as
ProperNouns activating this option will greatly improve performance as much
less words need to be looked-up in the Vocabulary. When linking to a Vocabulary
that defines Entities that might be mentioned as common nouns this option need
to be deactivated.
* __Processed Languages__
_(org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages)_:
This features allows (1) to explicitly define languages processed by the Engine
and (2) allows to provide Language Specific Configurations. Language specific
configurations will override/extend engine global configurations. See the next
section for details
*
_org.apache.stanbol.enhancer.engines.keywordextraction.maxSearchTokenDistance_:
The maximum number of Tokens searched around linked Tokens to be included in
searches within the linked vocabulary (default value is '3'). As an Example in
the text section "at the University of Munich a new procedure to" only "Munich"
would be looked-up in the Vocabulary in case "ProperNoun" linking is
activated. However for searching possible matches in the Vocabulary it makes
sense to use additional Tokens to reduce (and better rank) possible matches for
for "Munich". Because of that "matchable" words surrounding looked-up tokens
are considered to be included for searches in the Vocabulary. This parameter
allows to configure the maximum distance of words that are used for such
searches. Note that this parameter will not cause Words outside of a Chunk to
be used for searches (unless "Ingore Chunks" option is activated).
* _org.apache.stanbol.enhancer.engines.keywordextraction.masSearchTokens_: The
maximum number of Tokens used for searches in the Controlled Vocabulary
(default value is '2'). This sets the maximum number of Tokens used in OR
queries to the linked Vocabulary.
*_org.apache.stanbol.enhancer.engines.keywordextraction.dereferenceFields_:
Allows to define additional fields that are included for dereferneced Entities.
Only applied of "Dereference Entities" is enabled.
### Processed Language Configuration
With they key
_'org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages'_
the processed language(s) and language specific configurations can be applied.
For the configuration of the processed languages the following syntax is used:
de
en
This would configure the Engine to only process German and English texts. It is
also possible to explicitly exclude languages
!fr
!it
*
This specifies that all Languages other than French and Italien are processed.
Values MUST BE parsed as Array or Vector. This is done by using the
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following
example shows the two above examples combined to a single configuration.
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
In addition to specifying the processed languages this configuration can also
be used to parse language specific parameters. The syntax for parameters is as
follows
{language};{param-name}={param-value};{param-name}={param-value}
The following param-names are supported by the KeywordLinkingEngine
* __lc__: This allows to parse LexicalCategories of words that shall be looked
up in the Vocabulary. Valid values include the name's of members of the
LexicalCategory enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition",
...)
* __pos__: This allows to parse Pos types of words that shall be looked up in
the Vocabulary. Valid values include the name's of members of the Pos
enumeration (e.g. "ProperNoun", "CommonNoun", "Infinitive", "Gerund",
"PresentParticiple" and ~150 others).
* __tag__: This allows to parse string tags used by the POS tagger for an
language. Words that use those tags will be lokked-up with the vocabulary.
*__prob__: Allows a language specific setting of the _Min POS tag probability_
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_.
This value [0..1] is used to decide if a POS annotation is confident enough to
use it for linking or rejecting ('value/2' is sufficient for rejecting).
Note that a word is linked if either "lc", "pos" or "tag" do match. So setting
"pos=ProperNoun" will not have any effect if "lc=Noun" is already defined.
The following shows an "Processed Language Configuration" using all of the
above mentioned features
!fr
!it
nl;lc=Noun
*;pos=ProperNoun
This would process all languages other that French and Italien; link all Nouns
for Dutch texts and only ProperNouns for all others.
Users that want to define default parameters without using the "*" - wildcard
language can use an empty language for parsing the parameters. Here an example
nl;lc=Noun
da
en
es
pt
sv
de
;pos=ProperNoun
This explicitly includes the seven languages for that OpenNLP POS models are
included in the Stanbol Full Launcher. In addition it sets "Noun" linking for
Dutch - as the POS tagset for this language does not distinguish between
ProperNouns and CommonNouns. For the other six languages only "ProperNouns" are
linked.
Users that directly parse configurations as OSGI ".config" need to properly
escape configured parameters. The following example shows the above
configuration in the syntax used by ".config" files
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["nl;lc\=Noun","da","en","es","pt","sv","de",";pos\=ProperNoun"]
## Extension Points
This section describes Interfaces that are used as Extension Points by the
KeywordLinkingEngine
### EntitySearcher
The EntitySearch Interface is used by the KeywordLinkingEngine to search for
Entities in the linked Vocabulary. Currently the StanbolEntityhub based
implementations are instantiated based on the value of the
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_.
Users that want to use a different implementation of this Interface to be used
for linking will need to extend the KeywordLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the KeywordLinkingEngine and are expected to
set/unset the #entitySearcher field.
### LabelTokenizer
The LabelTokenizer interface is used to tokenize labels of Entities from the
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be
tokenized before they can be matched against the current context in the Text.
LabelTokenizer are OSGI services. Their configuration optionally can define the
_'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property.
Values are considered to be language configurations. Configurations can
explicitly include/exclude languages. Also a wildcard is supported (e.g.
"en,de" would include English and German; "!it,!fr,*" would specify all
languages expect Italian and French. If no configuration is provided than "*"
(all languages) is assumed.
The KeywordLinkingEngine will - by default - always use the LabelTokenizer with
the highest "service.ranking" for a given language to tokenize labels. By
default it comes with an OpenNLP based Tokenizer implementation that registers
itself for all languages with a "service.ranking" of "-1000".
Users that want to use a different Tokenizer need to register an implementation
for the given language(s) with an higher "service.ranking". Users that want to
provide there own LabelTokenizer and ignore the values provided by OSGI need to
extend the KeywordLinkingEngine set the #labelTokenizer field themself AND
override the #bindLabelTokenizer(LabelTokenizerManager ltm) and
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do
NOT change the #labelTokenizer field.
[1] http://svn.apache.org/viewvc?rev=1403242&view=rev
[2] https://svn.apache.org/repos/asf/stanbol/branches/stanbol-nlp-processing
> Adopt the KeywordLinkingEngine to use the AnalyzedText content part
> -------------------------------------------------------------------
>
> Key: STANBOL-740
> URL: https://issues.apache.org/jira/browse/STANBOL-740
> Project: Stanbol
> Issue Type: Sub-task
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> The KeywordLinkingEngine currently does both NLP processing AND linking
> against the target vocabulary. Up to now this was the only possibility as
> separating those two things was not feasible with the limitations of the RDF
> metadata.
> With the introduction of the AnalyzedText content part the NLP processing
> part needs no longer be part of the KeywordLinkingEngine.
> This issue covers
> * removal of the NLP related functionality from the KeywordLinkingEngine
> * reimplementation of the linking part on top of the API provided by the
> AnalyzedText contentpart
> * add support fore new features of the NLP chain
> * use lemmas - if available - for entity lookup
> * use POS tagset mappings to the OLIA ontology to decide what tokens to
> lookup
> After this change the KeywordLinkingEngine will also be able to work in
> combination with any NLP framework that is integrated with the Stanbol NLP
> components (writes its data to the AnalyzedText content part).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira