Author: rwesten
Date: Fri Nov 23 11:16:11 2012
New Revision: 1412829
URL: http://svn.apache.org/viewvc?rev=1412829&view=rev
Log:
initial documentation for STANBOL-733
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1412829&view=auto
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
(added)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Fri Nov 23 11:16:11 2012
@@ -0,0 +1,277 @@
+Title: EntityLinkingEngine
+
+The EntityLinkingEngine is an Engine that consumes results from NLP processing
from the [AnalyzedText](../nlp/analyzedtext) content part and uses those
information to link (search and match) entities from an configured vocabulary.
+
+For doing so it uses the following configurations and components:
+
+* __Text Processing Configuration__: This configures how the
EntityLinkingEngine consumes NLP processing results. Such configurations can be
language specific.
+* __Entity Linking Configuration__: This configures various properties that
are used for the linking process with the vocabulary
+* __EntitySearcher__: This interface is used to search and dereference
Entities. It needs to be implemented to use a datasource for linking with the
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub
(see [EntityhubLinkingEngine](entityhublinkingengine))
+* __LabelTokenizer__: While processed text is already tokenized the Entity
labels are note. For the matching of Labels with the text the
EntityLinkingEngine needs therefore to tokenizer those labels. Apache Stanbol
provides an default implementation of this interface based on the
[OpenNLP](http://opennlp.apache.org) tokenizer API.
+
+The EntityLinkingEngine can not directly be used as the four things listed
above need to be parsed in its constructor. It is instead intended to be
configured/extended by other components. The
[EntityhubLinkingEngine](entityhublinkingengine) is one of them configuring the
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
+
+This documentation first describes the implemented entity linking process than
provides information about the supported configuration parameters of the _Text
Processing Configuration_ and the _Entity Linking Configuration_. The last part
described how to extend the EntityLinking engine by implementing/providing
custom _EntitySearcher_ and _LabelTokenizer_.
+
+## Linking Process:
+
+The Linking Process consists of three major steps: First it consumes results
of the NLP processing to determine tokens - words - that need to be linked with
the configured vocabulary. Second the linking of entities based on their labels
with the current section of the Text and third the writing of the enhancement
results.
+
+
+### Token Types
+
+The KeywordLinkingEngine operates based on tokens (words). Those tokens are
divided in the following Categories
+
+* __Linkable Tokens__: This are words that are linked with the Vocabulary.
This means that the engine will issue quires in the controlled vocabulary for
those tokens
+* __Matchable Tokens__: Matchable tokens are used to refine quires. For the
matching of entity labels with the text those words are treated in the same way
as linkable words. So the main difference is that matchable words alone will
not cause the engine to query for Entities in the Controlled Vocabulary.
+* __Other Tokens__: All other tokens in the text are not used for searches in
the configured vocabulary. However during the matching of labels with the Text
they are considered as they might also be present in labels of entities
+
+"University of Salzburg" is a good example as 'University' - a common noun -
can be considered a matchable token, 'of' an other- and 'Salzburg' as proper
noun is a typical linkable token. As the engine only queries for linkable token
a single query for 'Salzburg' would be issued against the vocabulary. However
this query would also use the matchable token 'University' as a secondary query
term. The token 'of' would only be considered during matching.
+
+In addition to the token type the engine also determines the rolling parameters
+
+* __Token Length__: The number of characters of a word. This is especially
important for languages where no POS tagger is available.
+* __Alpha-Numeric__: If a Token does contain an alpha or an numeric character.
This is mainly used to skip processing of tokens that represent punctuation.
+* __Upper Case__: Upper Case Tokens do often represent named entities. because
of that the Engine keeps track of upper case words.
+* __Token Phrase__: If a Token is member of a _processable_ Phrase. Phrases
are groups of Tokens that can be detected by a Chunker. A typical examples are
Noun Phrases.
+
+
+### Consumed NLP Processing Results:
+
+The KeywordLinkingEngine consumes NLP processing results from the AnalyzedText
ContentPart of the processed ContentItem. The following list describes the
consumed information and their usage in the linking process:
+
+1. __Language_ _(required)_: The Language of the Text is acquired from the
Metadata of the ContentItem. It is required to search for labels in the correct
language and also to correctly apply language specific configurations of the
engine.
+2. __Sentences__ _(optional)_: Sentence annotations are used as segments for
the matching process. In addition for the first word of an Sentence the _Upper
Case_ feature is NOT set. In the case that no Sentence Annotations are present
the whole text is treated as a single Sentence.
+3. __Tokens__ _(required)_: As this Engine is based on the processing of
Tokens such information are absolutely required.
+4. __POS Annotations__ _(optional)_: Part of Speech (POS) tags are used to
determine the _Token Type_. The NLP processing module provides two enumerations
that define POS types. The high level _Lexical Categories_ (16 members
including "Noun", "Verb", "Adjective", "Adposition" ...) and the Pos
enumeration with ~150 very detailed POS definitions (such as (e.g.
"ProperNoun", "CommonNoun", "Infinitive", "Gerund", "PresentParticiple" â¦).
In addition the engine can also be configured to use the string tag as used by
the POS tagger. The mapping of the _POS Annotation_ to the _Token Type_ is
provided by the Engine configuration and can be language specific.
+5. __Phrase Annotation__ _(optional)_: Phrase Annotations of Chunks present in
the AnalyzedText are checked against the configured processable phrase
categories. The linking of Tokens is NOT limited to Tokens within processable
phrases. Phrases are only used as additional context to improve the matching
process. The _Lexical Category_ and the string tags used by the Chunker can be
used to configure the processable Phrase categories.
+6. __Lemma__ _(optional)_: The Lemma provided by the MorphoAnalysis annotation
can be used for linking instead of the token as used within the text.
+
+
+### Entity Linking:
+
+The linking process is based the matching of labels of entities returned as
result for searches for entities in the configured controlled vocabulary. In
addition the engine can be configured to consider redirects for entities
returned by searches.
+
+Searches are issued only for _Linkable Tokens_ and may include up to _Max
Search Tokens_ additional _Linkable-_ or _Matchable Tokens_. If the _Linkable
Token_ is within an _Phrase_ than only other tokens within the same phrase are
considered. Otherwise any _Linkable-_ or _Matchable Tokens_ within the
configured _Max Search Token Distance_ is considered for the search.
+
+Searches to the controlled vocabulary are issued using the _EntitySearcher_
interface and build like follows:
+
+ {lt}@{lang} || {lt}@{dl} || [{at}@{lang} || {at}@{dl} ... ]
+
+where:
+
+ * {lt} ... the _Linkable Token_ for that the search is issued
+ * {at} ... additional _Linkable-_ or _Matchable Tokens_ included in the
search
+ * {lang} ... the language of the text
+ * {dl} ... the configured _Default Matching Language_. If {df} == {lang}
than the or term(s) for the {dl} are omitted
+
+For results of those queries the labels in the {lang} and {dl} are matched
against the text. However {dl} labels are only considered if no match was found
for labels in the language of the text. For matching labels with the Tokens of
the text the engine need to tokenize the labels. This is done by using the
_LabelTokenizer_ interface.
+
+The matching process distinguishes between matchable and non-matchable Tokens
as well as non-alpha-numeric Tokens that are completely ignored. Matching
starts at the position of the _Linkable Token_ for that the search in the
configured vocabulary was issued. From this position Tokens in the Label are
matched with Tokens in the text until the first matchable or 2nd non-matchable
token is not found. In a second round the same is done in the backward
direction. The configured _Min Token Match Factor_ determines how exact tokens
in the text must correspond to tokens in the label so that a match is
considered. This is repeated for all labels of an Entity. The label match that
covers the most tokens is than considered as the match for that Entity.
+
+There are various parameters that can be used to fine tune the matching
process. But the most important decision is if one want to include suggestions
where labels with two tokens do only match a single _Matchable Token_ in the
Text (e.g. "Barack Obama" matching "Obama" but also 1000+ "Tom {something}"
matching "Tom"). The default configuration of the Engine excludes those but
depending on the use case and the linked vocabulary users might want to change
this. See the documentation of the _Min Matched Tokens_ and _Min Label Match
Score_ for details and examples.
+
+
+### Writing Enhancement Results
+
+This step covers the following steps:
+
+* processing of redirects as configured by the _Redirect Mode_
+* mapping of the Entity types to the dc:type values for fise:TextAnnotations
as configured by the _Type Mappings_ configuration
+* if _Dereference Entities_ is enabled than information for all configured
_Dereferenced Fields_ need to be obtained
+* writing of the fise:TextAnnotations, fise:EntityAnnotations and dereferenced
entities (if enabled) to the metadata of the processed ContentItem
+
+
+## Configurations
+
+The configuration of the EntityLinkingEngine done by parsing a
_TextProcessingConfig_ and an _EntityLinkingConfig_ in it constructor. Both
configuration classes provide an API base configuration (via getter and setter)
as well as an OSGI Dictionary based configuration (via a static method that
configures a new instance by an parsed configuration).
+
+The following two sections describe the "key, value" based configuration as
the API based version is anyway described by the JavaDoc.
+
+
+### Text Processing Configuration
+
+#### Proper Noun Linking
<small>_(enhancer.engines.linking.properNounsState)_</small>
+
+This is a high level configuration option allowing users to easily specify if
they want to do EntityLinking based on any Nouns ("Noun Linking") or only
ProperNouns ("Proper Noun Linking").
+Configuration wise this will pre-set the defaults for the linkable
_LexcicalCategories_ and _Pos_ types.
+
+"Noun linking" is equivalent to the behavior of the
[KeywordLinkingEngine](keywordlinkingengine) while "Proper Noun Linking" is
similar to using NER (Named Entity Recognition) with the
[NamedEntityLinking](namedentityextractionengine) engine.
+
+When activating "Proper Noun Linking" users need to ensure that:
+
+1. the POS tagging for given languages do support _Pos#ProperNoun_. If this is
not the case for some languages than language specific configurations need to
be used to manually adjust configurations for such languages. The next section
provides examples for that.
+2. the Entities in the Vocabulary linked against need typically be mentioned
as Proper Nouns in the Text. Users that need to link Vocabularies with Entities
that use common nouns as their labels (e.g. House, Mountain, Summer, ...) can
typically not use "Proper Noun Linking" with the following exceptions:
+ * Entities with labels comprised of multiple common nouns (e.g. White
House) can be detected in cases where _Chunk_s are supported and the _Link
Multiple Matchable Tokens in Phrases_ option is enabled (see the next
sub-section for details).
+ * In case Entities mentioned in the text are written as upper case tokens
that the _Upper Case Token Mode_ can be set to "LINK" (see the next sub-section
for details)
+
+If suitable it is strongly recommended to activate "Proper Noun Linking" as it
highly increases the performance because in typical text only around 1/10 of
the Nouns are marked as Proper Nouns and therefore the amount of vocabulary
lookups also decreases by this amount.
+
+#### Language Processing configuration
<small>_(enhancer.engines.linking.processedLanguages)_</small>
+
+This parameter is used for two things: (1) to specify what languages are
processed and (2) to provide specific configurations on how languages are
processed. For the 2nd aspect there is also a default configuration that can be
extended with language specific setting.
+
+__1. Processed Languages Configuration:__
+
+For the configuration of the processed languages the following syntax is used:
+
+ de
+ en
+
+This would configure the Engine to only process German and English texts. It
is also possible to explicitly exclude languages
+
+ !fr
+ !it
+ *
+
+This specifies that all Languages other than French and Italien are processed
by an EntityLinkingEngine instance.
+
+Values MUST BE parsed as Array or Vector. This is done by using the
["elem1","elem2",...] syntax as defined by OSGI ".config" files. The following
example shows the two above examples combined to a single configuration.
+
+
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["!fr","!it","de","en","*"]
+
+
+__2. Language specific Parameter Configuration__
+
+In addition to specifying the processed languages this configuration can also
be used to parse language specific parameters. The syntax for parameters is as
follows
+
+ {language};{param-name}={param-value};{param-name}={param-value}
+ *;{param-name}={param-value};{param-name}={param-value}
+ ;{param-name}={param-value};{param-name}={param-value}
+
+The first line sets the parameter for {language}. The 2nd and 3rd line show
that either the wildcard language '*' or the empty language '' can be used to
configure parameters that are used as defaults for all languages.
+
+The following param-names are supported by the KeywordLinkingEngine
+
+__Phrase level Parameters:__
+
+* __pc__ {name}::LexicalCategory - The _Phrase Categories_ processed by the
Engine. Valid values include the name's of members of the LexicalCategory
enumeration (e.g. "Noun", "Verb", "Adjective", "Adposition", ...)
+* __ptag__ {tag}::String - the _Phrase Tag_ processed by the Engine. This
allows to configure the String tags as used by the Chunker of a Language. This
should only be used of the Chunk types of the Chunker are not mapped with
members of the LexicalCategory enumeration.
+* __pprob__ [0..1)::double - the _Min Phrase Tag Probability_ for Chunks to be
accepted as processable ('value/2' is sufficient for rejecting).
+* __lmmtip__ [''/true/false]::boolean - the _Link Multiple Matchable Tokens in
Phrases_ parameter. As the name says it allows to enable/disable the linking of
multiple matchable tokens within the same Chunk. This is especially important
if _Proper Noun Linking_ is active, as it allows to detect 'named entities'
that are constituted by two common nouns. NOTE that 'lmmtip' is short for
'lmmtip=true'
+
+__Token level Parameters:__
+
+* __lc__ {name}::LexicalCategory - The linked _Token Categories_. Valid values
include the name's of members of the LexicalCategory enumeration (e.g. "Noun",
"Verb", "Adjective", "Adposition", â¦). Typical configurations include
"lc=Noun" or an empty list ("lc" or "lc=") to deactivate all categories and
provide more fine granular Pos or Tag level configuration.
+* __pos__ {name}::Pos - This linked _Pos Types_. Valid values include the
name's of members of the Pos enumeration (e.g. "ProperNoun", "CommonNoun",
"Infinitive", "Gerund", "PresentParticiple" and ~150 others). This parameter
can be used to provide a very fine granular configuration. It is e.g. used by
the _Link ProperNouns only_ setting to define that only "pos=ProperNoun" are
linked.
+* __tag__ {tag}::String - The linked _Pos Tags_. This parameter allows to
configure POS tags as used by the POS tagger. This is useful if those Tags are
not mapped to LexicalCategories or Pos types.
+*__prob__ [0..1)::double - the _Min PosTag Probability_. This parameter
replaces the formally used _Min POS tag probability_
_(org.apache.stanbol.enhancer.engines.keywordextraction.minPosTagProbability)_
property. It defines the minimum confidence so that a POS annotation is
accepted for linkable and matchable tokens ('value/2' is sufficient for
rejecting none linked/matched tokens).
+* __uc__ {NONE/MATCH/LINK}::string - the _Upper Case Token Mode_ allows to
configure how upper case words are treated. There are three possible modes: (1)
NONE: defines that they are not specially treated; (2) MATCH defines that they
are considered as matchable tokens (independent of the POS tag or the token
length; (3) LINK: defines that they are in any case linked with the vocabulary.
The default is "LINK" - as upper case words often represent named entities -
with the exception of German ('de') where the mode is set to MATCH - as all
Nouns in German are upper case.
+
+NOTE: that tokens are linked if any of "lc", "pos" or "tag" match the
configuration. This means that adding "lc=Noun" will render "pos=ProperNoun"
useless as the Pos type ProperNoun is already included in the LexicalCategory
Noun.
+
+__Examples:__
+
+The default configuration for the KeywordLinkingEngine uses the following
setting
+
+ *;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+ de;uc=MATCH
+ es;lc=Noun
+ nl;lc=Noun
+
+The first line enable _Link Multiple Matchable Tokens in Phrases_ and linking
of upper case tokens for all languages. In addition it sets the minimum
probabilities for Pos- and Phrase annotations to 0.75 (what would be also the
default). The following three lines provide additional language specific
defaults. For German the upper case mode is reset to MATCH as in German all
Nouns use upper case. For Spain and Dutch linking for the LexicalCategory Noun
is enabled. This is because the OpenNLP POS tagger for those languages does not
support ProperNoun's and therefore the Engine would not link any tokens if
_Link ProperNouns only_ is enabled. The same configuration in the OSGI
'.config' file syntax would look like follows
+
+
org.apache.stanbol.enhancer.engines.keywordextraction.processedLanguages=["*;lmmtip;uc\=LINK;prop\=0.75;pprob\=0.75","de;uc\=MATCH","es;lc\=Noun","nl;lc\=Noun"]
+
+The 2nd example shows how to define default settings without using the
wildcard '*' that would enable processing of all languages. The following
example shows an configuration that only enables English and ignores text in
all other languages.
+
+ ;lmmtip;uc=LINK;prop=0.75;pprob=0.75
+ en
+ de;uc=MATCH
+
+
+### Entity Linker Configuration
+
+This configuration allows to configure the linking process with the controlled
vocabulary. This includes all searching, matching as well as writing
Enhancements for suggestions. _NOTE_ that all parameters do support String
values regardless of the data type. E.g. parsing "true" is supported for
boolean; "1.5" for floating points ...
+
+* __Label Field__ _(enhancer.engines.linking.labelField)_: The name of the
field/property used to link (search and match) Entities. Only a single field is
supported for performance reasons.
+* __Case Sensitivity__ _(enhancer.engines.linking.caseSensitive)_: Boolean
switch that allows to activate/deactivate case sensitive matching. It is
important to understand that even with case sensitivity activated an Entity
with the label such as "Anaconda" will be suggested for the mention of
"anaconda" in the text. The main difference will be the confidence value of
such a suggestion as with case sensitivity activated the starting letters "A"
and "a" are NOT considered to be matching. See the second technical part for
details about the matching process. Case Sensitivity is deactivated by default.
It is recommended to be activated if controlled vocabularies contain
abbreviations similar to commonly used words e.g. CAN for Canada.
+* __Type Field__ _(enhancer.engines.linking.typeField)_: Values of this field
are used as values of the "fise:entity-types" property of created
"[fise:EntityAnnotation](../enhancementstructure.html#fiseentityannotation)"s.
The default is "rdf:type". _NOTE_ that in contrast to the
[NamedEntityLinking](namedentityextractionengine) the types are not used for
the linking process. They are only used while writing the
'fise:EntityAnnotation's and to determine the 'dc:type' values of
'fise:TextAnnotation's.
+* __Type Mappings__ _(enhancer.engines.linking.typeMappings)_: The FISE
enhancement structure (as used by the Stanbol Enhancer) distinguishes
[TextAnnotation](../enhancementstructure.html#fisetextannotation) and
[EntityAnnotation](../enhancementstructure.html#fiseentityannotation)s. The
Keyword linking engine needs to create both types of Annotations:
TextAnnotations selecting the words that match some Entities in the Controlled
Vocabulary and EntityAnnotations that represent an Entity suggested for a
TextAnnotation. The Type Mappings are used to determine the "dc:type" of the
TextAnnotation based on the types of the suggested Entity. The default
configuration comes with mappings for Persons, Organizations, Places and
Concepts but this fields allows to define additional mappings. For details
about the syntax see the sub-section "Type Mapping Syntax" below.
+* __Redirect Field__ _(enhancer.engines.linking.redirectField)_ and __Redirect
Mode__ _(enhancer.engines.linking.redirectMode)_: Redirects allow to follow
links to other entities defined in the vocabulary linked against. This is
useful in cases where matched Entities are not equals to the Entities that
users want to suggest. A good example is [DBpedia](http://dbpedia.org) where
the Entity 'dbpedia:USA' defines only the label "USA" and an redirect to the
Entity 'dbpedia:United_States' with all the information. The _Redirect Mode_
can now be used to define if redirects should be "IGNORE"; "ADD_VALUES" causes
information of the redirected entity ('dbpedia:United_States') to be added to
the matched one ('dbpedia:USA'); "FOLLOW" will suggest the redirected Entity
('dbpedia:United_States') instead of the matched one ('dbpedia:USA'). The
_Redirect Field_ defines the field/property used for redirects.
+* __Suggestions__ _(enhancer.engines.linking.suggestions)_: The maximum number
of suggestions. The default value for this is '3'. If the engine is used in
combination with an post processing engine (e.g. disambiguation) that users
might want to increase this value.
+
+The following properties define how Linkable and Matchable Tokens are linked
against the Entities of the linked vocabulary
+
+* __Default Matching Language__
_(enhancer.engines.linking.defaultMatchingLanguage)_: Linking is always done in
the language of the processed text and in the _Default Matching Language_. By
default the default language are labels without an language tag, but this
parameter allows to override this to a specific language. This is e.g. useful
for [DBpedia](http://dbpedia.org) where all labels are marked with the language
of the source Wikipedia data. So it makes sense to configure the default
matching language to this value.
+* __Max Search Token Distance__
_(enhancer.engines.linking.maxSearchTokenDistance)_: The maximum number of
Tokens searched around a linked token to search for additional matchable tokens
to be included for searches for Entities. The default value is '3'. As an
Example in the text section "at the University of Munich a new procedure to"
only "Munich" would be marked as linkable token if _Proper Noun Linking_ is
activated. However for searching Entities it makes sense to also use the
matchable term 'University', because otherwise a search would potentially
return an huge number of candidates of Entities mentioning 'Munich' in their
labels. This parameter allows to configure the maximum distance of tokens so
that the EntityLinkingEngine may include them as additional optional
constraints for queries via the EntitySearcher interface. _NOTE_ that this
parameter will not allow to include tokens outside of a _processable chunk_ if
the _linked token_ is within an such.
+* __Max Search Tokens__ _(enhancer.engines.linking.maxSearchTokens)_: The
maximum number of Tokens used for searches via the _EntitySearcher_ interface.
The default value is '2'. In case more _matchable tokens_ are within the
configured _Max Search Token Distance_ than those closer & trailing the
_linkable token_ are preferred. E.g. the text "president Barack Obama" where
'Barack' is the currently active _linkable token_ will result in a query with
the tokens 'Barack' OR 'Obama' if _Max Search Tokens_=2 and _Max Search Token
Distance_>=1 because both 'president' and 'Obama' do have a distance of 1 but
trailing Tokens are preferred.
+* __Lemma based Matching__ _(enhancer.engines.linking.lemmaMatching)_: If this
feature in enabled than the _MorphoFeatures#getLemma()_ values are used instead
of the _Token#getSpan()_s if present.
+* __Min Search Token Length__
_(enhancer.engines.linking.minSearchTokenLength)_: This is used as fallback if
the _Tokens_ in the _[AnalyzedText](../nlp/analyzedtext)_ do not contain Part
of Speech annotations or if the confidence of those annotations is to low. The
default value is '3' meaning that in such cases all tokens with more than '3'
characters are linked with the vocabulary. _NOTE_ that this configuration might
move to the _Text Processing Configuration_ in future versions.
+
+The parameters below are used to configure the matching process.
+
+* __Minimum Token Match Score__ _(enhancer.engines.linking.minTokenScore)_:
This defines how well single tokens of the text need to match single tokens in
the label so that they are considered as matching. This parameter configures
the lower limit. However the actual token match score does also influence the
overall matching scores for labels with the text. So non exact matches will
decrease matching scores for the whole label with the text.
+* __Min Label Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minLabelMatchFactor_
[0..1]::double: The "Label Score" [0..1] represents how much of the Label of an
Entity matches with the Text. It compares the number of Tokens of the Label
with the number of Tokens matched to the Text. Not exact matches for Tokens, or
if the Tokens within the label do appear in an other order than in the text do
also reduce this score. Entities are only considered if at least one of their
labels cores higher than the minimum for all tree of _Min Label Match Score_,
_Min Text Match Score_ and _Min Match Score_.
+* __Min Matched Tokens__
_org.apache.stanbol.enhancer.engines.keywordextraction.minFoundTokens_
[1..*]::int: The minimum number of matching tokens. Only "matchable" tokens are
counted. For full matches (where all tokens of the Label do match tokens in the
text) this parameter is ignored.
+
+ This parameter is strongly related with the _Min Label Match Score_
Typical setting are
+
+ 1. _Min Matched Tokens_=1 and _Min Label Match Score_ > 0.5 (e.g. 0.75)
+ 2. _Min Matched Tokens_=2 and _Min Label Match Score_ <= 0.5 (e.g. 0.5)
+
+ For Labels containing of one or two words both options do have the same
result, but for Longer labels (1) is more restrictive than (2). The important
thing is that both options ensures that Labels with more than one tokens will
not be considered if only a single token does match the text.
+
+ If used in combination with an disambiguation Engine one might want to
consider to suggest Entities where only a single token of multi-token labels do
match. In such cases a configuration like _Min Matched Tokens_=1 and _Min Label
Match Score_ <= 0.5 (e.g. 0.4) might be considered. With such scenarios users
will also want to considerable increase the value for _Max Suggestions_
(typically values > 10).
+* __Min Text Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_
[0..1]::double: The "Text Score" [0..1] represents how well the Label of an
Entity matches to the selected Span in the Text. It compares the number of
matched {@link Token} from the label with the number of Tokens enclosed by the
Span in the Text an Entity is suggested for. Not exact matches for Tokens, or
if the Tokens within the label do appear in an other order than in the text do
also reduce this score. Entities are only considered if at least one of their
labels cores higher than the minimum for all tree of _Min Label Match Score_,
_Min Text Match Score_ and _Min Match Score_.
+* __Min Match Score__
_org.apache.stanbol.enhancer.engines.keywordextraction.minTextMatchFactor_
[0..1]::double: Defined as the product of the "Text Score" with the "Label
Score" - meaning that this value represents both how well the label matches the
text and how much of the label is matched with the text. Entities are only
considered if at least one of their labels cores higher than the minimum for
all tree of _Min Label Match Score_, _Min Text Match Score_ and _Min Match
Score_.
+
+#### Type Mappings Syntax
+
+The Type Mappings are used to determine the "dc:type" of the
[TextAnnotation](../enhancementstructure.html#fisetextannotation) based on the
types of the suggested Entity. The field "Type Mappings" (property:
_org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings_) can be
used to customize such mappings.
+
+This field uses the following syntax
+
+ {uri}
+ {source} > {target}
+ {source1}; {source2}; ... {sourceN} > {target}
+
+The first variant is a shorthand for {uri} > {uri} and therefore specifies
that the {uri} should be used as 'dc:type' for
[TextAnnotation](../enhancementstructure.html#fisetextannotation)s if the
matched entity is of type {uri}. The second variant matches a {source} URI to a
{target}. Variant three shows the possibility to match multiple URIs to the
same target in a single configuration line.
+
+Both 'ns:localName' and full qualified URIs are supported. For supported
namespaces see the
[NamespaceEnum](http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/generic/servicesapi/src/main/java/org/apache/stanbol/entityhub/servicesapi/defaults/NamespaceEnum.java).
Information about accepted (INFO) and ignored (WARN) type mappings are
available in the logs.
+
+Some Examples of additional Mappings for the e-health domain:
+
+ drugbank:drugs; dbp-ont:Drug; dailymed:drugs; sider:drugs; tcm:Medicine >
drugbank:drugs
+ diseasome:diseases; linkedct:condition; tcm:Disease > diseasome:diseases
+ sider:side_effects
+ dailymed:ingredients
+ dailymed:organization > dbp-ont:Organisation
+
+The first two lines map some will known Classes that represent drugs and
diseases to 'drugbank:drugs' and 'diseasome:diseases'. The third and fourth
line define 1:1 mappings for side effects and ingredients and the last line
adds 'dailymed:organization' as an additional mapping to DBpedia Ontology
Organisation.
+
+The following mappings are predefined by the KeywordLinkingEngine.
+
+ dbp-ont:Person; foaf:Person; schema:Person > dbp-ont:Person
+ dbp-ont:Organisation; dbp-ont:Newspaper; schema:Organization >
dbp-ont:Organisation
+ dbp-ont:Place; schema:Place; gml:_Feature > dbp-ont:Place
+ skos:Concept
+
+## Extension Points
+
+This section describes Interfaces that are used as Extension Points by the
KeywordLinkingEngine
+
+### EntitySearcher
+
+The EntitySearch Interface is used by the EntityLinkingEngine to search for
Entities in the linked Vocabulary. This interface supports two main
functionalities:
+
+__Dereference Entities__ _get(String id,Set<String>
includeFields)::Representation_
+
+This method is called with the 'id' of an Entity and needs to return the data
of the Entity as _Representation_. The returned _Representation_ needs to at
least include the parsed 'includeFields'. If 'includeFields' is empty or NULL
than all information for the Entity should be included in the returned
_Representation_.
+
+__Entity Search__ __lookup(String field, Set<String> includeFields,
List<String> search, String[] languages,Integer
limit)::Collection<Representation>
+
+This method is used for searching entities in the controlled vocabulary. The
configured _Label Field_ is parsed in the 'field' parameter. The
'includedFileds' contain all fields required for the linking process.
_Representation_s returned as result need to include values for those fields.
The 'search' parameter includes the tokens used for the search. Values should
be considered optional however Results are considered to rank Entities that
match more search entires first.
+
+
+Currently the StanbolEntityhub based implementations are instantiated based on
the value of the
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_.
Users that want to use a different implementation of this Interface to be used
for linking will need to extend the KeywordLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the KeywordLinkingEngine and are expected to
set/unset the #entitySearcher field.
+
+### LabelTokenizer
+
+The LabelTokenizer interface is used to tokenize labels of Entities from the
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be
tokenized before they can be matched against the current context in the Text.
+
+LabelTokenizer are OSGI services. Their configuration optionally can define
the _'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property.
Values are considered to be language configurations. Configurations can
explicitly include/exclude languages. Also a wildcard is supported (e.g.
"en,de" would include English and German; "!it,!fr,*" would specify all
languages expect Italian and French. If no configuration is provided than "*"
(all languages) is assumed.
+
+The KeywordLinkingEngine will - by default - always use the LabelTokenizer
with the highest "service.ranking" for a given language to tokenize labels. By
default it comes with an OpenNLP based Tokenizer implementation that registers
itself for all languages with a "service.ranking" of "-1000".
+
+Users that want to use a different Tokenizer need to register an
implementation for the given language(s) with an higher "service.ranking".
Users that want to provide there own LabelTokenizer and ignore the values
provided by OSGI need to extend the KeywordLinkingEngine set the
#labelTokenizer field themself AND override the
#bindLabelTokenizer(LabelTokenizerManager ltm) and
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do
NOT change the #labelTokenizer field.