entitylinking.mdtext

rwesten Fri, 23 Nov 2012 03:16:43 -0800

Author: rwesten
Date: Fri Nov 23 11:16:11 2012
New Revision: 1412829

URL: http://svn.apache.org/viewvc?rev=1412829&view=rev
Log:
initial documentation for STANBOL-733


Added:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1412829&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 Fri Nov 23 11:16:11 2012
@@ -0,0 +1,277 @@
+Title: EntityLinkingEngine
+
+The EntityLinkingEngine is an Engine that consumes results from NLP processing 
from the [AnalyzedText](../nlp/analyzedtext) content part and uses those 
information to link (search and match) entities from an configured vocabulary.
+
+For doing so it uses the following configurations and components:
+
+* __Text Processing Configuration__: This configures how the 
EntityLinkingEngine consumes NLP processing results. Such configurations can be 
language specific.
+* __Entity Linking Configuration__: This configures various properties that 
are used for the linking process with the vocabulary
+* __EntitySearcher__: This interface is used to search and dereference 
Entities. It needs to be implemented to use a datasource for linking with the 
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub 
(see [EntityhubLinkingEngine](entityhublinkingengine))
+* __LabelTokenizer__: While processed text is already tokenized the Entity 
labels are note. For the matching of Labels with the text the 
EntityLinkingEngine needs therefore to tokenizer those labels. Apache Stanbol 
provides an default implementation of this interface based on the 
[OpenNLP](http://opennlp.apache.org) tokenizer API.
+
+The EntityLinkingEngine can not directly be used as the four things listed 
above need to be parsed in its constructor. It is instead intended to be 
configured/extended by other components. The 
[EntityhubLinkingEngine](entityhublinkingengine) is one of them configuring the 
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
+
+This documentation first describes the implemented entity linking process than 
provides information about the supported configuration parameters of the _Text 
Processing Configuration_ and the _Entity Linking Configuration_. The last part 
described how to extend the EntityLinking engine by implementing/providing 
custom _EntitySearcher_ and _LabelTokenizer_.
+
+## Linking Process:
+
+The Linking Process consists of three major steps: First it consumes results 
of the NLP processing to determine tokens - words - that need to be linked with 
the configured vocabulary. Second the linking of entities based on their labels 
with the current section of the Text and third the writing of the enhancement 
results.
+
+
+### Token Types
+
+The KeywordLinkingEngine operates based on tokens (words). Those tokens are 
divided in the following Categories
+
+* __Linkable Tokens__: This are words that are linked with the Vocabulary. 
This means that the engine will issue quires in the controlled vocabulary for 
those tokens
+* __Matchable Tokens__: Matchable tokens are used to refine quires. For the 
matching of entity labels with the text those words are treated in the same way 
as linkable words. So the main difference is that matchable words alone will 
not cause the engine to query for Entities in the Controlled Vocabulary.
+* __Other Tokens__: All other tokens in the text are not used for searches in 
the configured vocabulary. However during the matching of labels with the Text 
they are considered as they might also be present in labels of entities
+
+"University of Salzburg" is a good example as 'University' - a common noun - 
can be considered a matchable token, 'of' an other- and 'Salzburg' as proper 
noun is a typical linkable token. As the engine only queries for linkable token 
a single query for 'Salzburg' would be issued against the vocabulary. However 
this query would also use the matchable token 'University' as a secondary query 
term. The token 'of' would only be considered during matching.
+
+In addition to the token type the engine also determines the rolling parameters
+
+* __Token Length__: The number of characters of a word. This is especially 
important for languages where no POS tagger is available.
+* __Alpha-Numeric__: If a Token does contain an alpha or an numeric character. 
This is mainly used to skip processing of tokens that represent punctuation.
+* __Upper Case__: Upper Case Tokens do often represent named entities. because 
of that the Engine keeps track of upper case words.
+* __Token Phrase__: If a Token is member of a _processable_ Phrase. Phrases 
are groups of Tokens that can be detected by a Chunker. A typical examples are 
Noun Phrases.
+
+
+### Consumed NLP Processing Results:
+
+The KeywordLinkingEngine consumes NLP processing results from the AnalyzedText 
ContentPart of the processed ContentItem. The following list describes the 
consumed information and their usage in the linking process: 
+
+1. __Language_ _(required)_: The Language of the Text is acquired from the 
Metadata of the ContentItem. It is required to search for labels in the correct 
language and also to correctly apply language specific configurations of the 
engine.
+2. __Sentences__ _(optional)_: Sentence annotations are used as segments for 
the matching process. In addition for the first word of an Sentence the _Upper 
Case_ feature is NOT set. In the case that no Sentence Annotations are present 
the whole text is treated as a single Sentence.
+3. __Tokens__ _(required)_: As this Engine is based on the processing of 
Tokens such information are absolutely required.
+4. __POS Annotations__ _(optional)_: Part of Speech (POS) tags are used to 
determine the _Token Type_. The NLP processing module provides two enumerations 
that define POS types. The high level _Lexical Categories_ (16 members 
including "Noun", "Verb", "Adjective", "Adposition" ...) and the Pos 
enumeration with ~150 very detailed POS definitions (such as (e.g. 
"ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" â¦). 
In addition the engine can also be configured to use the string tag as used by 
the POS tagger. The mapping of the _POS Annotation_ to the _Token Type_ is 
provided by the Engine configuration and can be language specific.
+5. __Phrase Annotation__ _(optional)_: Phrase Annotations of Chunks present in 
the AnalyzedText are checked against the configured processable phrase 
categories. The linking of Tokens is NOT limited to Tokens within processable 
phrases. Phrases are only used as additional context to improve the matching 
process. The _Lexical Category_ and the string tags used by the Chunker can be 
used to configure the processable Phrase categories.
+6. __Lemma__ _(optional)_: The Lemma provided by the MorphoAnalysis annotation 
can be used for linking instead of the token as used within the text.
+
+
+### Entity Linking:
+
+The linking process is based the matching of labels of entities returned as 
result for searches for entities in the configured controlled vocabulary. In 
addition the engine can be configured to consider redirects for entities 
returned by searches.
+
+Searches are issued only for _Linkable Tokens_ and may include up to _Max 
Search Tokens_ additional _Linkable-_ or _Matchable Tokens_. If the _Linkable 
Token_ is within an _Phrase_ than only other tokens within the same phrase are 
considered. Otherwise any _Linkable-_ or _Matchable Tokens_ within the 
configured _Max Search Token Distance_ is considered for the search.
+
+Searches to the controlled vocabulary are issued using the _EntitySearcher_ 
interface and build like follows:
+
+    {lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]
+
+where:
+
+    * {lt} ... the _Linkable Token_ for that the search is issued
+    * {at} ... additional _Linkable-_ or _Matchable Tokens_ included in the 
search
+    * {lang} ... the language of the text
+    * {dl} ... the configured _Default Matching Language_. If {df} == {lang} 
than the or term(s) for the {dl} are omitted
+
+For results of those queries the labels in the {lang} and {dl} are matched 
against the text. However {dl} labels are only considered if no match was found 
for labels in the language of the text. For matching labels with the Tokens of 
the text the engine need to tokenize the labels. This is done by using the 
_LabelTokenizer_ interface.
+
+The matching process distinguishes between matchable and non-matchable Tokens 
as well as non-alpha-numeric Tokens that are completely ignored. Matching 
starts at the position of the _Linkable Token_ for that the search in the 
configured vocabulary was issued. From this position Tokens in the Label are 
matched with Tokens in the text until the first matchable or 2nd non-matchable 
token is not found. In a second round the same is done in the backward 
direction. The configured _Min Token Match Factor_ determines how exact tokens 
in the text must correspond to tokens in the label so that a match is 
considered. This is repeated for all labels of an Entity. The label match that 
covers the most tokens is than considered as the match for that Entity.
+
+There are various parameters that can be used to fine tune the matching 
process. But the most important decision is if one want to include suggestions 
where labels with two tokens do only match a single _Matchable Token_ in the 
Text (e.g. "Barack Obama" matching "Obama" but also 1000+ "Tom {something}" 
matching "Tom"). The default configuration of the Engine excludes those but 
depending on the use case and the linked vocabulary users might want to change 
this. See the documentation of the _Min Matched Tokens_ and _Min Label Match 
Score_ for details and examples. 
+
+
+### Writing Enhancement Results
+
+This step covers the following steps:
+
+* processing of redirects as configured by the _Redirect Mode_
+* mapping of the Entity types to the dc:type values for fise:TextAnnotations 
as configured by the _Type Mappings_ configuration 
+* if _Dereference Entities_ is enabled than information for all configured 
_Dereferenced Fields_ need to be obtained
+* writing of the fise:TextAnnotations, fise:EntityAnnotations and dereferenced 
entities (if enabled) to the metadata of the processed ContentItem
+
+
+## Configurations
+
+The configuration of the EntityLinkingEngine done by parsing a 
_TextProcessingConfig_ and an _EntityLinkingConfig_ in it constructor. Both 
configuration classes provide an API base configuration (via getter and setter) 
as well as an OSGI Dictionary based configuration (via a static method that 
configures a new instance by an parsed configuration).
+
+The following two sections describe the "key, value" based configuration as 
the API based version is anyway described by the JavaDoc.
+
+
+### Text Processing Configuration
+
+#### Proper Noun Linking 
<small>_(enhancer.engines.linking.properNounsState)_</small>
+
+This is a high level configuration option allowing users to easily specify if 
they want to do EntityLinking based on any Nouns ("Noun Linking") or only 
ProperNouns ("Proper Noun Linking").
+Configuration wise this will pre-set the defaults for the linkable 
_LexcicalCategories_ and _Pos_ types.
+
+"Noun linking" is equivalent to the behavior of the 
[KeywordLinkingEngine](keywordlinkingengine) while "Proper Noun Linking" is 
similar to using NER (Named Entity Recognition) with the 
[NamedEntityLinking](namedentityextractionengine) engine. 
+
+When activating "Proper Noun Linking" users need to ensure that:
+
+1. the POS tagging for given languages do support _Pos#ProperNoun_. If this is 
not the case for some languages than language specific configurations need to 
be used to manually adjust configurations for such languages. The next section 
provides examples for that.
+2. the Entities in the Vocabulary linked against need typically be mentioned 
as Proper Nouns in the Text. Users that need to link Vocabularies with Entities 
that use common nouns as their labels (e.g. House, Mountain, Summer, ...) can 
typically not use "Proper Noun Linking" with the following exceptions:
+    * Entities with labels comprised of multiple common nouns (e.g. White 
House) can be detected in cases where _Chunk_s are supported and the _Link 
Multiple Matchable Tokens in Phrases_ option is enabled (see the next 
sub-section for details).
+    * In case Entities mentioned in the text are written as upper case tokens 
that the _Upper Case Token Mode_ can be set to "LINK" (see the next sub-section 
for details)
+
+If suitable it is strongly recommended to activate "Proper Noun Linking" as it 
highly increases the performance because in typical text only around 1/10 of 
the Nouns are marked as Proper Nouns and therefore the amount of vocabulary 
lookups also decreases by this amount.
+
+#### Language Processing configuration 
<small>_(enhancer.engines.linking.processedLanguages)_</small>
+
+This parameter is used for two things: (1) to specify what languages are 
processed and (2) to provide specific configurations on how languages are 
processed. For the 2nd aspect there is also a default configuration that can be 
extended with language specific setting.
+
+__1. Processed Languages Configuration:__
+
+For the configuration of the processed languages the following syntax is used:
+
+    de
+    en
+    
+This would configure the Engine to only process German and English texts. It 
is also possible to explicitly exclude languages
+
+    !fr
+    !it
+    *
+
+This specifies that all Languages other than French and Italien are processed 
by an EntityLinkingEngine instance.
+
+Values MUST BE parsed as Array or Vector. This is done by using the 
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following 
example shows the two above examples combined to a single configuration.
+
+    
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
+
+
+__2. Language specific Parameter Configuration__
+
+In addition to specifying the processed languages this configuration can also 
be used to parse language specific parameters. The syntax for parameters is as 
follows
+
+    {language};{param-name}={param-value};{param-name}={param-value}
+    *;{param-name}={param-value};{param-name}={param-value}
+    ;{param-name}={param-value};{param-name}={param-value}
+
+The first line sets the parameter for {language}. The 2nd and 3rd line show 
that either the wildcard language '*' or the empty language '' can be used to 
configure parameters that are used as defaults for all languages. 
+
+The following param-names are supported by the KeywordLinkingEngine
+
+__Phrase level Parameters:__
+
+* __pc__ {name}::LexicalCategory - The _Phrase Categories_ processed by the 
Engine. Valid values include the name's of members of the LexicalCategory 
enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", ...)
+* __ptag__ {tag}::String - the _Phrase Tag_ processed by the Engine. This 
allows to configure the String tags as used by the Chunker of a Language. This 
should only be used of the Chunk types of the Chunker are not mapped with 
members of the LexicalCategory enumeration.
+* __pprob__ [0..1)::double - the _Min Phrase Tag Probability_ for Chunks to be 
accepted as processable ('value/2' is sufficient for rejecting).
+* __lmmtip__ [''/true/false]::boolean - the _Link Multiple Matchable Tokens in 
Phrases_ parameter. As the name says it allows to enable/disable the linking of 
multiple matchable tokens within the same Chunk. This is especially important 
if _Proper Noun Linking_ is active, as it allows to detect 'named entities' 
that are constituted by two common nouns. NOTE that 'lmmtip' is short for 
'lmmtip=true'
+
+__Token level Parameters:__
+
+* __lc__ {name}::LexicalCategory - The linked _Token Categories_. Valid values 
include the name's of members of the LexicalCategory enumeration (e.g. "Noun", 
"Verb", "Adjective", "Adposition", â¦). Typical configurations include 
"lc=Noun" or an empty list ("lc" or "lc=") to deactivate all categories and 
provide more fine granular Pos or Tag level configuration.
+* __pos__ {name}::Pos - This linked _Pos Types_. Valid values include the 
name's of members of the Pos enumeration (e.g. "ProperNoun", "CommonNoun", 
"Infinitive", "Gerund", "PresentParticiple" and ~150 others). This parameter 
can be used to provide a very fine granular configuration. It is e.g. used by 
the _Link ProperNouns only_ setting to define that only "pos=ProperNoun" are 
linked.
+* __tag__ {tag}::String - The linked _Pos Tags_. This parameter allows to 
configure POS tags as used by the POS tagger. This is useful if those Tags are 
not mapped to LexicalCategories or Pos types.
+*__prob__ [0..1)::double - the _Min PosTag Probability_. This parameter 
replaces the formally used _Min POS tag probability_ 
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_ 
property. It defines the minimum confidence so that a POS annotation is 
accepted for linkable and matchable tokens ('value/2' is sufficient for 
rejecting none linked/matched tokens).
+* __uc__ {NONE/MATCH/LINK}::string - the _Upper Case Token Mode_ allows to 
configure how upper case words are treated. There are three possible modes: (1) 
NONE: defines that they are not specially treated; (2) MATCH defines that they 
are considered as matchable tokens (independent of the POS tag or the token 
length; (3) LINK: defines that they are in any case linked with the vocabulary. 
The default is "LINK" - as upper case words often represent named entities - 
with the exception of German ('de') where the mode is set to MATCH - as all 
Nouns in German are upper case.
+
+NOTE: that tokens are linked if any of "lc", "pos" or "tag" match the 
configuration. This means that adding "lc=Noun" will render "pos=ProperNoun" 
useless as the Pos type ProperNoun is already included in the LexicalCategory 
Noun.
+
+__Examples:__
+
+The default configuration for the KeywordLinkingEngine uses the following 
setting
+
+    *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    de;uc=MATCH
+    es;lc=Noun
+    nl;lc=Noun
+
+The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking 
of upper case tokens for all languages. In addition it sets the minimum 
probabilities for Pos- and Phrase annotations to 0.75 (what would be also the 
default). The following three lines provide additional language specific 
defaults. For German the upper case mode is reset to MATCH as in German all 
Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun 
is enabled. This is because the OpenNLP POS tagger for those languages does not 
support ProperNoun's and therefore the Engine would not link any tokens if 
_Link ProperNouns only_ is enabled. The same configuration in the OSGI 
'.config' file syntax would look like follows
+
+    
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
+
+The 2nd example shows how to define default settings without using the 
wildcard '*' that would enable processing of all languages. The following 
example shows an configuration that only enables English and ignores text in 
all other languages.
+
+    ;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+    en
+    de;uc=MATCH
+
+
+### Entity Linker Configuration
+
+This configuration allows to configure the linking process with the controlled 
vocabulary. This includes all searching, matching as well as writing 
Enhancements for suggestions. _NOTE_ that all parameters do support String 
values regardless of the data type. E.g. parsing "true" is supported for 
boolean; "1.5" for floating points ...
+
+* __Label Field__ _(enhancer.engines.linking.labelField)_: The name of the 
field/property used to link (search and match) Entities. Only a single field is 
supported for performance reasons.
+* __Case Sensitivity__ _(enhancer.engines.linking.caseSensitive)_: Boolean 
switch that allows to activate/deactivate case sensitive matching. It is 
important to understand that even with case sensitivity activated an Entity 
with the label such as "Anaconda" will be suggested for the mention of 
"anaconda" in the text. The main difference will be the confidence value of 
such a suggestion as with case sensitivity activated the starting letters "A" 
and "a" are NOT considered to be matching. See the second technical part for 
details about the matching process. Case Sensitivity is deactivated by default. 
It is recommended to be activated if controlled vocabularies contain 
abbreviations similar to commonly used words e.g. CAN for Canada.
+* __Type Field__ _(enhancer.engines.linking.typeField)_: Values of this field 
are used as values of the "fise:entity-types" property of created 
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s. 
The default is "rdf:type". _NOTE_ that in contrast to the 
[NamedEntityLinking](namedentityextractionengine) the types are not used for 
the linking process. They are only used while writing the 
'fise:EntityAnnotation's and to determine the 'dc:type' values of 
'fise:TextAnnotation's.
+* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE 
enhancement structure (as used by the Stanbol Enhancer) distinguishes 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and 
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The 
Keyword linking engine needs to create both types of Annotations: 
TextAnnotations selecting the words that match some Entities in the Controlled 
Vocabulary and EntityAnnotations that represent an Entity suggested for a 
TextAnnotation. The Type Mappings are used to determine the "dc:type" of the 
TextAnnotation based on the types of the suggested Entity. The default 
configuration comes with mappings for Persons, Organizations, Places and 
Concepts but this fields allows to define additional mappings. For details 
about the syntax see the sub-section "Type Mapping Syntax" below.
+* __Redirect Field__ _(enhancer.engines.linking.redirectField)_ and __Redirect 
Mode__ _(enhancer.engines.linking.redirectMode)_: Redirects allow to follow 
links to other entities defined in the vocabulary linked against. This is 
useful in cases where matched Entities are not equals to the Entities that 
users want to suggest. A good example is [DBpedia](http://dbpedia.org) where 
the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the 
Entity 'dbpedia:United_States' with all the information. The _Redirect Mode_ 
can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes 
information of the redirected entity ('dbpedia:United_States') to be added to 
the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity 
('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The 
_Redirect Field_ defines the field/property used for redirects.
+* __Suggestions__ _(enhancer.engines.linking.suggestions)_: The maximum number 
of suggestions. The default value for this is '3'. If the engine is used in 
combination with an post processing engine (e.g. disambiguation) that users 
might want to increase this value.
+
+The following properties define how Linkable and Matchable Tokens are linked 
against the Entities of the linked vocabulary
+
+* __Default Matching Language__ 
_(enhancer.engines.linking.defaultMatchingLanguage)_: Linking is always done in 
the language of the processed text and in the _Default Matching Language_. By 
default the default language are labels without an language tag, but this 
parameter allows to override this to a specific language. This is e.g. useful 
for [DBpedia](http://dbpedia.org) where all labels are marked with the language 
of the source Wikipedia data. So it makes sense to configure the default 
matching language to this value.
+* __Max Search Token Distance__ 
_(enhancer.engines.linking.maxSearchTokenDistance)_: The maximum number of 
Tokens searched around a linked token to search for additional matchable tokens 
to be included for searches for Entities. The default value is '3'. As an 
Example in the text section "at the University of Munich a new procedure to" 
only "Munich" would be marked as linkable token if _Proper Noun Linking_ is 
activated. However for searching Entities it makes sense to also use the 
matchable term 'University', because otherwise a search would potentially 
return an huge number of candidates of Entities mentioning 'Munich' in their 
labels. This parameter allows to configure the maximum distance of tokens so 
that the EntityLinkingEngine may include them as additional optional 
constraints for queries via the EntitySearcher interface. _NOTE_ that this 
parameter will not allow to include tokens outside of a _processable chunk_ if 
the _linked token_ is within an such.
+* __Max Search Tokens__ _(enhancer.engines.linking.maxSearchTokens)_: The 
maximum number of Tokens used for searches via the _EntitySearcher_ interface. 
The default value is '2'. In case more _matchable tokens_ are within the 
configured _Max Search Token Distance_ than those closer & trailing the 
_linkable token_ are preferred. E.g. the text "president Barack Obama" where 
'Barack' is the currently active _linkable token_ will result in a query with 
the tokens 'Barack' OR 'Obama' if _Max Search Tokens_=2 and _Max Search Token 
Distance_>=1 because both 'president' and 'Obama' do have a distance of 1 but 
trailing Tokens are preferred. 
+* __Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_: If this 
feature in enabled than the _MorphoFeatures#getLemma()_ values are used instead 
of the _Token#getSpan()_s if present.
+* __Min Search Token Length__ 
_(enhancer.engines.linking.minSearchTokenLength)_: This is used as fallback if 
the _Tokens_ in the _[AnalyzedText](../nlp/analyzedtext)_ do not contain Part 
of Speech annotations or if the confidence of those annotations is to low. The 
default value is '3' meaning that in such cases all tokens with more than '3' 
characters are linked with the vocabulary. _NOTE_ that this configuration might 
move to the _Text Processing Configuration_ in future versions.
+
+The parameters below are used to configure the matching process.
+
+* __Minimum Token Match Score__ _(enhancer.engines.linking.minTokenScore)_: 
This defines how well single tokens of the text need to match single tokens in 
the label so that they are considered as matching. This parameter configures 
the lower limit. However the actual token match score does also influence the 
overall matching scores for labels with the text. So non exact matches will 
decrease matching scores for the whole label with the text.
+* __Min Label Match Score__ 
_org.apache.stanbol.enhancer.engines.keywordextraction.minLabelMatchFactor_ 
[0..1]::double: The "Label Score" [0..1] represents how much of the Label of an 
Entity matches with the Text. It compares the number of Tokens of the Label 
with the number of Tokens matched to the Text. Not exact matches for Tokens, or 
if the Tokens within the label do appear in an other order than in the text do 
also reduce this score. Entities are only considered if at least one of their 
labels cores higher than the minimum for all tree of _Min Label Match Score_, 
_Min Text Match Score_ and _Min Match Score_.
+* __Min Matched Tokens__ 
_org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens_ 
[1..*]::int: The minimum number of matching tokens. Only "matchable" tokens are 
counted. For full matches (where all tokens of the Label do match tokens in the 
text) this parameter is ignored.
+
+    This parameter is strongly related with the _Min Label Match Score_ 
Typical setting are
+
+    1. _Min Matched Tokens_=1 and _Min Label Match Score_ > 0.5 (e.g. 0.75)
+    2. _Min Matched Tokens_=2 and _Min Label Match Score_ <= 0.5 (e.g. 0.5)
+
+    For Labels containing of one or two words both options do have the same 
result, but for Longer labels (1) is more restrictive than (2). The important 
thing is that both options ensures that Labels with more than one tokens will 
not be considered if only a single token does match the text.
+
+    If used in combination with an disambiguation Engine one might want to 
consider to suggest Entities where only a single token of multi-token labels do 
match. In such cases a configuration like _Min Matched Tokens_=1 and _Min Label 
Match Score_ <= 0.5 (e.g. 0.4) might be considered. With such scenarios users 
will also want to considerable increase the value for _Max Suggestions_ 
(typically values > 10).
+* __Min Text Match Score__ 
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_ 
[0..1]::double: The "Text Score" [0..1] represents how well the Label of an 
Entity matches to the selected Span in the Text. It compares the number of 
matched {@link Token} from the label with the number of Tokens enclosed by the 
Span in the Text an Entity is suggested for. Not exact matches for Tokens, or 
if the Tokens within the label do appear in an other order than in the text do 
also reduce this score. Entities are only considered if at least one of their 
labels cores higher than the minimum for all tree of _Min Label Match Score_, 
_Min Text Match Score_ and _Min Match Score_.
+* __Min Match Score__ 
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_ 
[0..1]::double: Defined as the product of the "Text Score" with the "Label 
Score" - meaning that this value represents both how well the label matches the 
text and how much of the label is matched with the text. Entities are only 
considered if at least one of their labels cores higher than the minimum for 
all tree of _Min Label Match Score_, _Min Text Match Score_ and _Min Match 
Score_. 
+
+#### Type Mappings Syntax
+
+The Type Mappings are used to determine the "dc:type" of the 
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the 
types of the suggested Entity. The field "Type Mappings" (property: 
_org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be 
used to customize such mappings.
+
+This field uses the following syntax
+
+    {uri}
+    {source} > {target}
+    {source1}; {source2}; ... {sourceN} > {target}
+
+The first variant is a shorthand for {uri} > {uri} and therefore specifies 
that the {uri} should be used as 'dc:type' for 
[TextAnnotation](../enhancementstructure.html#fisetextannotation)s if the 
matched entity is of type {uri}. The second variant matches a {source} URI to a 
{target}. Variant three shows the possibility to match multiple URIs to the 
same target in a single configuration line.
+
+Both 'ns:localName' and full qualified URIs are supported. For supported 
namespaces see the 
[NamespaceEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/generic/servicesapi/src/main/java/org/apache/stanbol/entityhub/servicesapi/defaults/NamespaceEnum.java).
 Information about accepted (INFO) and ignored (WARN) type mappings are 
available in the logs.
+
+Some Examples of additional Mappings for the e-health domain:
+
+    drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine > 
drugbank:drugs
+    diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases 
+    sider:side_effects
+    dailymed:ingredients
+    dailymed:organization > dbp-ont:Organisation
+
+The first two lines map some will known Classes that represent drugs and 
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth 
line define 1:1 mappings for side effects and ingredients and the last line 
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology 
Organisation.
+
+The following mappings are predefined by the KeywordLinkingEngine.
+
+    dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
+    dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization > 
dbp-ont:Organisation
+    dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place
+    skos:Concept
+
+## Extension Points
+
+This section describes Interfaces that are used as Extension Points by the 
KeywordLinkingEngine
+
+### EntitySearcher
+
+The EntitySearch Interface is used by the EntityLinkingEngine to search for 
Entities in the linked Vocabulary. This interface supports two main 
functionalities:
+
+__Dereference Entities__ _get(String id,Set<String> 
includeFields)::Representation_
+
+This method is called with the 'id' of an Entity and needs to return the data 
of the Entity as _Representation_. The returned _Representation_ needs to at 
least include the parsed 'includeFields'. If 'includeFields' is empty or NULL 
than all information for the Entity should be included in the returned 
_Representation_.
+
+__Entity Search__ __lookup(String field, Set<String> includeFields, 
List<String> search, String[] languages,Integer 
limit)::Collection<Representation>
+
+This method is used for searching entities in the controlled vocabulary. The 
configured _Label Field_ is parsed in the 'field' parameter. The 
'includedFileds' contain all fields required for the linking process. 
_Representation_s returned as result need to include values for those fields. 
The 'search' parameter includes the tokens used for the search. Values should 
be considered optional however Results are considered to rank Entities that 
match more search entires first.
+
+
+Currently the StanbolEntityhub based implementations are instantiated based on 
the value of the 
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_. 
Users that want to use a different implementation of this Interface to be used 
for linking will need to extend the KeywordLinkingEngine and override the 
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object> 
configuration) and #deactivateEntitySearcher(). Those methods are called during 
activation/deactivation of the KeywordLinkingEngine and are expected to 
set/unset the #entitySearcher field.
+
+### LabelTokenizer
+
+The LabelTokenizer interface is used to tokenize labels of Entities from the 
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based 
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be 
tokenized before they can be matched against the current context in the Text.
+
+LabelTokenizer are OSGI services. Their configuration optionally can define 
the _'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property. 
Values are considered to be language configurations. Configurations can 
explicitly include/exclude languages. Also a wildcard is supported (e.g. 
"en,de" would include English and German; "!it,!fr,*" would specify all 
languages expect Italian and French. If no configuration is provided than "*" 
(all languages) is assumed.
+
+The KeywordLinkingEngine will - by default - always use the LabelTokenizer 
with the highest "service.ranking" for a given language to tokenize labels. By 
default it comes with an OpenNLP based Tokenizer implementation that registers 
itself for all languages with a "service.ranking" of "-1000".
+
+Users that want to use a different Tokenizer need to register an 
implementation for the given language(s) with an higher "service.ranking". 
Users that want to provide there own LabelTokenizer and ignore the values 
provided by OSGI need to extend the KeywordLinkingEngine set the 
#labelTokenizer field themself AND override the 
#bindLabelTokenizer(LabelTokenizerManager ltm) and 
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do 
NOT change the #labelTokenizer field.

svn commit: r1412829 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Reply via email to