engines: entityhublinking.mdtext entityhublinking.png entitylinking.mdtext

rwesten Fri, 23 Nov 2012 04:45:11 -0800

Author: rwesten
Date: Fri Nov 23 12:44:40 2012
New Revision: 1412857

URL: http://svn.apache.org/viewvc?rev=1412857&view=rev
Log:
initial documentation for STANBOL-733


Added:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
   (with props)
Modified:
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext?rev=1412857&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.mdtext
 Fri Nov 23 12:44:40 2012
@@ -0,0 +1,26 @@
+Title: The Entityhub Linking Engine: Linking NLP processed Text with 
Vocabularies managed by the Stanbol Entityhub
+
+The EntityhubLinkingEngine is the successor of the 
[KeywordLinkingEngine](keywordlinkingengine). It is based on the 
[EntityLinkingEngine](entitylinkingengine) configured with an 
[EntitySearcher](entitylinkingengine#entitysearcher) that can link Entities 
managed by either the Entityhub, ReferencedSites as well as ManagedSites. The 
EntityhubLinkingEngine does not implement the [EnhancementEngine](index) 
interface itself. It only configures an instance of the 
[EntityLinkingEngine](entitylinkingengine).
+
+For a detailed documentation of the linking process please see the 
documentation of the [EntityLinkingEngine](entitylinkingengine). This document 
only focuses on the configuration and the usage of this Engine.
+
+
+## Configuration
+
+The configuration of the EntityhubLinkingEngine supports the following 
options. First it allows to configure the two properties common to all 
enhancement engines
+
+* __Name__ _(stanbol.enhancer.engine.name)_: The name of the Enhancement 
Engine. This name is used to refer an [EnhancementEngine](index.html) in 
[EnhancementChain](enhancementchain.html)s
+* __ServiceRankging__ _(service.ranking)_: In case multiple enhancement 
engines do use the same name, than only the one with the higher ranking will 
get uses.
+
+Next it allows to configure the used Entityhub Site
+
+* __Referenced Site__ _(enhancer.engines.linking.entityhub.siteId)_: The name 
of the ReferencedSite of the Stanbol Entityhub that holds the controlled 
vocabulary to be used for extracting Entities. "entityhub" or "local" can be 
used to extract Entities managed directly by the Entityhub.
+
+Finally it supports all configuration options supported by the 
[EntityLinkingEngine](entitylinkingengine).
+
+* [Text Processing 
Configuration](entitylinking#text-processing-configuration): This defines what 
languages are enabled and is also used to configure how NLP processing results 
are used by the Engine
+* [Entity Linking Configuration](entitylinking#entity-linker-configuration): 
This defines how entity are searched in the vocabulary and search results are 
matched with the text. It also allows to configure 'dc:type's for created 
'fise:TextAnnotation's and if entity information are included in the 
enhancement results or not.
+
+The following screenshot shows the configuration dialog of the 
EntityhubLinkingEngine as shown when using the Apache Felix Webconsole for its 
configuration. However users need to know that this dialog only provides a 
limited set of configuration options. Other supported configuration options can 
only be configured by directly using OSGI "*.config" files.
+
+![Configuration dialog for the 
EntityhubLinkingEngine](entityhublinkigconfig.png)
\ No newline at end of file

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png?rev=1412857&view=auto
==============================================================================
Binary file - no diff available.

Propchange: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entityhublinking.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext?rev=1412857&r1=1412856&r2=1412857&view=diff
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 (original)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.mdtext
 Fri Nov 23 12:44:40 2012
@@ -6,10 +6,10 @@ For doing so it uses the following confi
 
 * __Text Processing Configuration__: This configures how the 
EntityLinkingEngine consumes NLP processing results. Such configurations can be 
language specific.
 * __Entity Linking Configuration__: This configures various properties that 
are used for the linking process with the vocabulary
-* __EntitySearcher__: This interface is used to search and dereference 
Entities. It needs to be implemented to use a datasource for linking with the 
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub 
(see [EntityhubLinkingEngine](entityhublinkingengine))
+* __EntitySearcher__: This interface is used to search and dereference 
Entities. It needs to be implemented to use a datasource for linking with the 
EntityLinkingEngine. Stanbol provides implementations for the Stanbol Entityhub 
(see [EntityhubLinkingEngine](entityhublinking))
 * __LabelTokenizer__: While processed text is already tokenized the Entity 
labels are note. For the matching of Labels with the text the 
EntityLinkingEngine needs therefore to tokenizer those labels. Apache Stanbol 
provides an default implementation of this interface based on the 
[OpenNLP](http://opennlp.apache.org) tokenizer API.
 
-The EntityLinkingEngine can not directly be used as the four things listed 
above need to be parsed in its constructor. It is instead intended to be 
configured/extended by other components. The 
[EntityhubLinkingEngine](entityhublinkingengine) is one of them configuring the 
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
+The EntityLinkingEngine can not directly be used as the four things listed 
above need to be parsed in its constructor. It is instead intended to be 
configured/extended by other components. The 
[EntityhubLinkingEngine](entityhublinking) is one of them configuring the 
EntityLinkingEngine with EntitySearcher for the Stanbol Entityhub.
 
 This documentation first describes the implemented entity linking process than 
provides information about the supported configuration parameters of the _Text 
Processing Configuration_ and the _Entity Linking Configuration_. The last part 
described how to extend the EntityLinking engine by implementing/providing 
custom _EntitySearcher_ and _LabelTokenizer_.
 
@@ -117,7 +117,7 @@ For the configuration of the processed l
 
     de
     en
-    
+
 This would configure the Engine to only process German and English texts. It 
is also possible to explicitly exclude languages
 
     !fr
@@ -253,25 +253,66 @@ This section describes Interfaces that a
 
 ### EntitySearcher
 
-The EntitySearch Interface is used by the EntityLinkingEngine to search for 
Entities in the linked Vocabulary. This interface supports two main 
functionalities:
+The EntitySearch Interface is used by the EntityLinkingEngine to search for 
Entities in the linked Vocabulary. An EntitySearcher instance is parsed in the 
constructor of the EntityLinkingEngine.
+
+This interface supports with search and dereference two main functionalities 
but also provides some additional metadata. The following list provides a short 
overview about the methods.
 
-__Dereference Entities__ _get(String id,Set<String> 
includeFields)::Representation_
+* __Dereference Entities__ _get(String id,Set&lt;String&gt; 
includeFields)::Representation_
 
 This method is called with the 'id' of an Entity and needs to return the data 
of the Entity as _Representation_. The returned _Representation_ needs to at 
least include the parsed 'includeFields'. If 'includeFields' is empty or NULL 
than all information for the Entity should be included in the returned 
_Representation_.
 
-__Entity Search__ __lookup(String field, Set<String> includeFields, 
List<String> search, String[] languages,Integer 
limit)::Collection<Representation>
+* __Entity Search__ _lookup(String field, Set&lt;String&gt; includeFields, 
List&lt;String&gt; search, String[] languages,Integer 
limit)::Collection&lt;Representation&gt;_
 
-This method is used for searching entities in the controlled vocabulary. The 
configured _Label Field_ is parsed in the 'field' parameter. The 
'includedFileds' contain all fields required for the linking process. 
_Representation_s returned as result need to include values for those fields. 
The 'search' parameter includes the tokens used for the search. Values should 
be considered optional however Results are considered to rank Entities that 
match more search entires first.
+This method is used for searching entities in the controlled vocabulary. The 
configured _Label Field_ is parsed in the 'field' parameter. The 
'includedFileds' contain all fields required for the linking process. 
_Representation_s returned as result need to include values for those fields. 
The 'search' parameter includes the tokens used for the search. Values should 
be considered optional however Results are considered to rank Entities that 
match more search tokens first. The array of 'languages' is used to parse the 
languages that need to be considered for the search. If 'languages' contains 
NULL or '' it means that also labels without an language tag need to be 
included in the search (NOTE that this DOES NOT mean to include labels of any 
language!). Finally the 'limit' parameter is used to specify the maximum number 
of results. If NULL than the implementation can choose an meaningful default.
 
+* __Offline Mode__ _supportsOfflineMode()::boolean_ : indicates if the 
EntitySearcher implementation needs to connect an remote service. This is 
needed to deactivate the EntityLinkingEngine in cases where Apache Stanbol is 
started in OfflineMode
+* __Serach Result Limit__ _getLimit()::Integer_ : The maximum number of search 
results supported by the EntitySearcher implementation. Can return NULL if not 
applicable or unknown.
+* __Origin Information__ 
_getOriginInformation()::Map&lt;UriRef,Collection&lt;Resource&gt;&gt;_ : This 
method allows to return information about the origin that are added to every 
'fise:EntityAnnotation' created by the EntityLinkingEngine. This is e.g. used 
by the Entityhub based information to provide the 'id' of the Entiyhub Site 
where the Entities where retrieved from. 
+
+The [EntityhubLinkingEngine](entityhublinking) includes EntitySearcher 
implementations based on the FieldQuery search interface implemented by the 
Stanbol Entityhub.
 
 Currently the StanbolEntityhub based implementations are instantiated based on 
the value of the 
_'org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId'_. 
Users that want to use a different implementation of this Interface to be used 
for linking will need to extend the KeywordLinkingEngine and override the 
#activateEntitySearcher(ComponentContext context, Dictionary<String,Object> 
configuration) and #deactivateEntitySearcher(). Those methods are called during 
activation/deactivation of the KeywordLinkingEngine and are expected to 
set/unset the #entitySearcher field.
 
 ### LabelTokenizer
 
-The LabelTokenizer interface is used to tokenize labels of Entities from the 
linked Vocabulary. As the matching process of the KeywordLinkingEngine is based 
on Tokens (words) multi-word labels (e.g. Univerity of Munich) need to be 
tokenized before they can be matched against the current context in the Text.
+The LabelTokenizer interface is used to tokenize labels of Entity suggestions 
as returned by the [EntitySearcer](#entitysearcher). As the matching process of 
the KeywordLinkingEngine is based on Tokens (words) multi-word labels (e.g. 
Univerity of Munich) need to be tokenized before they can be matched against 
the current context in the Text.
+
+The _LabelTokenizer_ interface defines only the single _tokenize(String label, 
String language)::String[]_ method that gets the label and the language as 
parameter and returns the tokens as a String array. If the tokenizer where not 
able to tokenize the label (e.g. because he does not support the language) it 
MUST return NULL. In this case the NamedEntityLinking engine will try to match 
the label as a single token.
+
+#### MainLabelTokenizer
+
+As it might very likely be the case that users will want to use multiple 
LabelTokenizer for different languages the EntityLinkingEngine comes with an 
MainLabelTokenizer implementation. It registers itself as LabelTokenizer with 
highest possible OSGI 'service.ranking' and tracks all other registered 
_LabelTokenizer_s.
+
+So if custom _LabelTokenizer_s register themselves as OSGI service than the 
MainLabelTokenizer can forward requests to them. It will do so in the order of 
the '<code>service.ranking</code>'s. in addition _LabelTokenizer_ can use the 
'<code>enhancer.engines.keywordextraction.labeltokenizer.languages</code>' 
property to formally specify the languages they are supporting. This property 
does use the language configuration syntax (e.g. "en,de" would include English 
and German; "!it,!fr,*" would specify all languages expect Italian and French). 
If no configuration is provided than "*" (all languages) is assumed - what is 
fine as default as long as _LabelTokenizer_ correctly return NULL for languages 
they do not support.
+
+The MainLabelTokenizer forwards tokenize requests to all available 
LabelTokenizer implementations that support a specific language sorted by their 
'<code>service.ranking</code>' until the first one does NOT return NULL. If no 
LabelTokenizer was found or all returned NULL it will also return NULL.
+
+The following code snippet shows how to use the _MainLabelTokenizer_ as 
_LabelTokenizer_ for the _EntityLinkingEngine_
+
+    :::java
+    @Reference
+    LabelTokenizer labelTokenizer;
+
+This will inject the MainLabelTokenizer as it uses 
<code>Integer.MAX_VALUE</code> as <code>service.ranking</code>.
+
+    :::java
+    @Activate
+    protected void activate(ComponentContext ctx){
+        //within the activate method it can than be used
+        //to initialize the NamedEntityLinkingEngine
+        NamedEntityLinkingEngine engine = new NamedEntityLinkingEngine(
+            engineName,
+            entitySearcher, //the searcher might not be available
+            textProcessingConfig, linkerConfig, //config
+            labelTokenizer); //the MainLabelTokenizer
+
+Configuring the NamedEntityLinkingEngine like this ensures that all registered 
_LabelTokenizer_s are considered for tokenizing.
+
+#### OpenNLP LabelTokenizer
+
+This is the default implementation of an LabelTokenizer based on the 
[OpenNLP](http://opennlp.apache.org) tokenizer API. Internally it uses the 
OpenNLP service to load tokenizer models for languages. If language specific 
model is available it uses the OpenNLP SimpleTokenizer implementation. The 
_OpenNlpLabelTokenizer_ registers itself with a '<code>service.ranking</code>' 
of '-1000' so it will b
 
-LabelTokenizer are OSGI services. Their configuration optionally can define 
the _'enhancer.engines.keywordextraction.labeltokenizer.languages'_ property. 
Values are considered to be language configurations. Configurations can 
explicitly include/exclude languages. Also a wildcard is supported (e.g. 
"en,de" would include English and German; "!it,!fr,*" would specify all 
languages expect Italian and French. If no configuration is provided than "*" 
(all languages) is assumed.
+The _LabelTokenizerManager_ interface extends the _
 
 The KeywordLinkingEngine will - by default - always use the LabelTokenizer 
with the highest "service.ranking" for a given language to tokenize labels. By 
default it comes with an OpenNLP based Tokenizer implementation that registers 
itself for all languages with a "service.ranking" of "-1000".
 
-Users that want to use a different Tokenizer need to register an 
implementation for the given language(s) with an higher "service.ranking". 
Users that want to provide there own LabelTokenizer and ignore the values 
provided by OSGI need to extend the KeywordLinkingEngine set the 
#labelTokenizer field themself AND override the 
#bindLabelTokenizer(LabelTokenizerManager ltm) and 
#unbindLabelTokenizer(LabelTokenizerManager ltm) methods in a way that they do 
NOT change the #labelTokenizer field.

svn commit: r1412857 - in /stanbol/site/trunk/content/docs/trunk/components/enhancer/engines: entityhublinking.mdtext entityhublinking.png entitylinking.mdtext

Reply via email to