Author: rwesten
Date: Fri Nov 23 12:44:40 2012
New Revision: 1412857
URL: http://svn.apache.org/viewvc?rev=1412857&view=rev
Log:
initial documentation for STANBOL-733
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
(with props)
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext?rev=1412857&view=auto
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
(added)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
Fri Nov 23 12:44:40 2012
@@ -0,0 +1,26 @@
+Title: The Entityhub Linking Engine: Linking NLP processed Text with
Vocabularies managed by the Stanbol Entityhub
+
+The EntityhubLinkingEngine is the successor of the
[KeywordLinkingEngine](keywordlinkingengine). It is based on the
[EntityLinkingEngine](entitylinkingengine) configured with an
[EntitySearcher](entitylinkingengine#entitysearcher) that can link Entities
managed by either the Entityhub, ReferencedSites as well as ManagedSites. The
EntityhubLinkingEngine does not implement the [EnhancementEngine](index)
interface itself. It only configures an instance of the
[EntityLinkingEngine](entitylinkingengine).
+
+For a detailed documentation of the linking process please see the
documentation of the [EntityLinkingEngine](entitylinkingengine). This document
only focuses on the configuration and the usage of this Engine.
+
+
+## Configuration
+
+The configuration of the EntityhubLinkingEngine supports the following
options. First it allows to configure the two properties common to all
enhancement engines
+
+* __Name__ _(stanbol.enhancer.engine.name)_: The name of the Enhancement
Engine. This name is used to refer an [EnhancementEngine](index.html) in
[EnhancementChain](enhancementchain.html)s
+* __ServiceRankging__ _(service.ranking)_: In case multiple enhancement
engines do use the same name, than only the one with the higher ranking will
get uses.
+
+Next it allows to configure the used Entityhub Site
+
+* __Referenced Site__ _(enhancer.engines.linking.entityhub.siteId)_: The name
of the ReferencedSite of the Stanbol Entityhub that holds the controlled
vocabulary to be used for extracting Entities. "entityhub" or "local" can be
used to extract Entities managed directly by the Entityhub.
+
+Finally it supports all configuration options supported by the
[EntityLinkingEngine](entitylinkingengine).
+
+* [Text Processing
Configuration](entitylinking#text-processing-configuration): This defines what
languages are enabled and is also used to configure how NLP processing results
are used by the Engine
+* [Entity Linking Configuration](entitylinking#entity-linker-configuration):
This defines how entity are searched in the vocabulary and search results are
matched with the text. It also allows to configure 'dc:type's for created
'fise:TextAnnotation's and if entity information are included in the
enhancement results or not.
+
+The following screenshot shows the configuration dialog of the
EntityhubLinkingEngine as shown when using the Apache Felix Webconsole for its
configuration. However users need to know that this dialog only provides a
limited set of configuration options. Other supported configuration options can
only be configured by directly using OSGI "*.config" files.
+
+
\ No newline at end of file
Added:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png?rev=1412857&view=auto
==============================================================================
Binary file - no diff available.
Propchange:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1412857&r1=1412856&r2=1412857&view=diff
==============================================================================
---
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
(original)
+++
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
Fri Nov 23 12:44:40 2012
@@ -6,10 +6,10 @@ For doing so it uses the following confi
* __Text Processing Configuration__: This configures how the
EntityLinkingEngine consumes NLP processing results. Such configurations can be
language specific.
* __Entity Linking Configuration__: This configures various properties that
are used for the linking process with the vocabulary
-* __EntitySearcher__: This interface is used to search and dereference
Entities. It needs to be implemented to use a datasource for linking with the
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub
(see [EntityhubLinkingEngine](entityhublinkingengine))
+* __EntitySearcher__: This interface is used to search and dereference
Entities. It needs to be implemented to use a datasource for linking with the
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub
(see [EntityhubLinkingEngine](entityhublinking))
* __LabelTokenizer__: While processed text is already tokenized the Entity
labels are note. For the matching of Labels with the text the
EntityLinkingEngine needs therefore to tokenizer those labels. Apache Stanbol
provides an default implementation of this interface based on the
[OpenNLP](http://opennlp.apache.org) tokenizer API.
-The EntityLinkingEngine can not directly be used as the four things listed
above need to be parsed in its constructor. It is instead intended to be
configured/extended by other components. The
[EntityhubLinkingEngine](entityhublinkingengine) is one of them configuring the
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
+The EntityLinkingEngine can not directly be used as the four things listed
above need to be parsed in its constructor. It is instead intended to be
configured/extended by other components. The
[EntityhubLinkingEngine](entityhublinking) is one of them configuring the
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
This documentation first describes the implemented entity linking process than
provides information about the supported configuration parameters of the _Text
Processing Configuration_ and the _Entity Linking Configuration_. The last part
described how to extend the EntityLinking engine by implementing/providing
custom _EntitySearcher_ and _LabelTokenizer_.
@@ -117,7 +117,7 @@ For the configuration of the processed l
de
en
-
+
This would configure the Engine to only process German and English texts. It
is also possible to explicitly exclude languages
!fr
@@ -253,25 +253,66 @@ This section describes Interfaces that a
### EntitySearcher
-The EntitySearch Interface is used by the EntityLinkingEngine to search for
Entities in the linked Vocabulary. This interface supports two main
functionalities:
+The EntitySearch Interface is used by the EntityLinkingEngine to search for
Entities in the linked Vocabulary. An EntitySearcher instance is parsed in the
constructor of the EntityLinkingEngine.
+
+This interface supports with search and dereference two main functionalities
but also provides some additional metadata. The following list provides a short
overview about the methods.
-__Dereference Entities__ _get(String id,Set<String>
includeFields)::Representation_
+* __Dereference Entities__ _get(String id,Set<String>
includeFields)::Representation_
This method is called with the 'id' of an Entity and needs to return the data
of the Entity as _Representation_. The returned _Representation_ needs to at
least include the parsed 'includeFields'. If 'includeFields' is empty or NULL
than all information for the Entity should be included in the returned
_Representation_.
-__Entity Search__ __lookup(String field, Set<String> includeFields,
List<String> search, String[] languages,Integer
limit)::Collection<Representation>
+* __Entity Search__ _lookup(String field, Set<String> includeFields,
List<String> search, String[] languages,Integer
limit)::Collection<Representation>_
-This method is used for searching entities in the controlled vocabulary. The
configured _Label Field_ is parsed in the 'field' parameter. The
'includedFileds' contain all fields required for the linking process.
_Representation_s returned as result need to include values for those fields.
The 'search' parameter includes the tokens used for the search. Values should
be considered optional however Results are considered to rank Entities that
match more search entires first.
+This method is used for searching entities in the controlled vocabulary. The
configured _Label Field_ is parsed in the 'field' parameter. The
'includedFileds' contain all fields required for the linking process.
_Representation_s returned as result need to include values for those fields.
The 'search' parameter includes the tokens used for the search. Values should
be considered optional however Results are considered to rank Entities that
match more search tokens first. The array of 'languages' is used to parse the
languages that need to be considered for the search. If 'languages' contains
NULL or '' it means that also labels without an language tag need to be
included in the search (NOTE that this DOES NOT mean to include labels of any
language!). Finally the 'limit' parameter is used to specify the maximum number
of results. If NULL than the implementation can choose an meaningful default.
+* __Offline Mode__ _supportsOfflineMode()::boolean_ : indicates if the
EntitySearcher implementation needs to connect an remote service. This is
needed to deactivate the EntityLinkingEngine in cases where Apache Stanbol is
started in OfflineMode
+* __Serach Result Limit__ _getLimit()::Integer_ : The maximum number of search
results supported by the EntitySearcher implementation. Can return NULL if not
applicable or unknown.
+* __Origin Information__
_getOriginInformation()::Map<UriRef,Collection<Resource>>_ : This
method allows to return information about the origin that are added to every
'fise:EntityAnnotation' created by the EntityLinkingEngine. This is e.g. used
by the Entityhub based information to provide the 'id' of the Entiyhub Site
where the Entities where retrieved from.
+
+The [EntityhubLinkingEngine](entityhublinking) includes EntitySearcher
implementations based on the FieldQuery search interface implemented by the
Stanbol Entityhub.
Currently the StanbolEntityhub based implementations are instantiated based on
the value of the
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_.
Users that want to use a different implementation of this Interface to be used
for linking will need to extend the KeywordLinkingEngine and override the
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object>
configuration) and #deactivateEntitySearcher(). Those methods are called during
activation/deactivation of the KeywordLinkingEngine and are expected to
set/unset the #entitySearcher field.
### LabelTokenizer
-The LabelTokenizer interface is used to tokenize labels of Entities from the
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be
tokenized before they can be matched against the current context in the Text.
+The LabelTokenizer interface is used to tokenize labels of Entity suggestions
as returned by the [EntitySearcer](#entitysearcher). As the matching process of
the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g.
Univerity of Munich) need to be tokenized before they can be matched against
the current context in the Text.
+
+The _LabelTokenizer_ interface defines only the single _tokenize(String label,
String language)::String[]_ method that gets the label and the language as
parameter and returns the tokens as a String array. If the tokenizer where not
able to tokenize the label (e.g. because he does not support the language) it
MUST return NULL. In this case the NamedEntityLinking engine will try to match
the label as a single token.
+
+#### MainLabelTokenizer
+
+As it might very likely be the case that users will want to use multiple
LabelTokenizer for different languages the EntityLinkingEngine comes with an
MainLabelTokenizer implementation. It registers itself as LabelTokenizer with
highest possible OSGI 'service.ranking' and tracks all other registered
_LabelTokenizer_s.
+
+So if custom _LabelTokenizer_s register themselves as OSGI service than the
MainLabelTokenizer can forward requests to them. It will do so in the order of
the '<code>service.ranking</code>'s. in addition _LabelTokenizer_ can use the
'<code>enhancer.engines.keywordextraction.labeltokenizer.languages</code>'
property to formally specify the languages they are supporting. This property
does use the language configuration syntax (e.g. "en,de" would include English
and German; "!it,!fr,*" would specify all languages expect Italian and French).
If no configuration is provided than "*" (all languages) is assumed - what is
fine as default as long as _LabelTokenizer_ correctly return NULL for languages
they do not support.
+
+The MainLabelTokenizer forwards tokenize requests to all available
LabelTokenizer implementations that support a specific language sorted by their
'<code>service.ranking</code>' until the first one does NOT return NULL. If no
LabelTokenizer was found or all returned NULL it will also return NULL.
+
+The following code snippet shows how to use the _MainLabelTokenizer_ as
_LabelTokenizer_ for the _EntityLinkingEngine_
+
+ :::java
+ @Reference
+ LabelTokenizer labelTokenizer;
+
+This will inject the MainLabelTokenizer as it uses
<code>Integer.MAX_VALUE</code> as <code>service.ranking</code>.
+
+ :::java
+ @Activate
+ protected void activate(ComponentContext ctx){
+ //within the activate method it can than be used
+ //to initialize the NamedEntityLinkingEngine
+ NamedEntityLinkingEngine engine = new NamedEntityLinkingEngine(
+ engineName,
+ entitySearcher, //the searcher might not be available
+ textProcessingConfig, linkerConfig, //config
+ labelTokenizer); //the MainLabelTokenizer
+
+Configuring the NamedEntityLinkingEngine like this ensures that all registered
_LabelTokenizer_s are considered for tokenizing.
+
+#### OpenNLP LabelTokenizer
+
+This is the default implementation of an LabelTokenizer based on the
[OpenNLP](http://opennlp.apache.org) tokenizer API. Internally it uses the
OpenNLP service to load tokenizer models for languages. If language specific
model is available it uses the OpenNLP SimpleTokenizer implementation. The
_OpenNlpLabelTokenizer_ registers itself with a '<code>service.ranking</code>'
of '-1000' so it will b
-LabelTokenizer are OSGI services. Their configuration optionally can define
the _'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property.
Values are considered to be language configurations. Configurations can
explicitly include/exclude languages. Also a wildcard is supported (e.g.
"en,de" would include English and German; "!it,!fr,*" would specify all
languages expect Italian and French. If no configuration is provided than "*"
(all languages) is assumed.
+The _LabelTokenizerManager_ interface extends the _
The KeywordLinkingEngine will - by default - always use the LabelTokenizer
with the highest "service.ranking" for a given language to tokenize labels. By
default it comes with an OpenNLP based Tokenizer implementation that registers
itself for all languages with a "service.ranking" of "-1000".
-Users that want to use a different Tokenizer need to register an
implementation for the given language(s) with an higher "service.ranking".
Users that want to provide there own LabelTokenizer and ignore the values
provided by OSGI need to extend the KeywordLinkingEngine set the
#labelTokenizer field themself AND override the
#bindLabelTokenizer(LabelTokenizerManager ltm) and
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do
NOT change the #labelTokenizer field.