customvocabulary.mdtext

rwesten Mon, 02 Jun 2014 00:55:19 -0700

Author: rwesten
Date: Mon Jun  2 07:54:02 2014
New Revision: 1599105

URL: http://svn.apache.org/r1599105
Log:
updated the Configuring Entity Linking section of the customvocabulary useage 
scenario


Modified:
    stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1599105&r1=1599104&r2=1599105&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Mon Jun  2 
07:54:02 2014
@@ -15,7 +15,8 @@ The aim of this usage scenario is to pro
 ## Overview
 
 The following figure shows the typical Enhancement workflow that may start 
with some preprocessing steps (e.g. the conversion of rich text formats to 
plain text) followed by the Natural Language Processing phase. Next 'Semantic 
Lifting' aims to connect the results of text processing and link it to the 
application domain of the user. During Postprocessing those results may get 
further refined.
-<p style="text-align: center;">![Typical Enhancement 
Workflow](enhancementworkflow.png "The typical Enhancement Chain includes the 
+
+<p style="text-align: center;">![Typical Enhancement 
Workflow](enhancementworkflow.png)</p>
 
 This usage scenario is all about the Semantic Lifting phase. This phase is 
most central to for how well enhancement results to match the requirements of 
the users application domain. Users that need to process health related 
documents will need to provide vocabularies containing life science related 
entities otherwise the Stanbol Enhancer will not perform as expected on those 
documents. Similar processing Customer requests can only work if Stanbol has 
access to data managed by the CRM.
 
@@ -181,17 +182,21 @@ The following Example shows a [enhanceme
 
 Both the [weighted chain](components/enhancer/chains/weightedchain.html) and 
the [list chain](components/enhancer/chains/listchain.html) can be used for the 
configuration of such a chain.
 
-### Configuring Named Entity Linking
+### Configuring Entity Linking
 
-First it is important to note the difference between _Named Entity Linking_ 
and _Entity Linking_. While _Named Entity Linking_ only considers _Named 
Entities_ detected by NER (Named Entity Recognition) _Entity Linking_ does work 
on Words (Tokens). Because of that is has much lower NLP requirements and can 
even operate for languages where only word tokenization is supported. However 
extraction results AND performance do greatly improve with POS (Part of Speech) 
tagging support. Also Chunking (Noun Phrase detection), NER and Lemmatization 
results can be consumed by Entity Linking to further improve extraction 
results. For details see the documentation of the [Entity Linking 
Process](components/enhancer/engines/entitylinking#linking-process).
+First it is important to note the difference between _Named Entity Linking_ 
and _Entity Linking_. While _Named Entity Linking_ only considers _Named 
Entities_ detected by NER (Named Entity Recognition) _Entity Linking_ does work 
on Words (Tokens). As NER support is only available for a limited number of 
languages _Named Entity Linking_ is only an option for those languages. _Entity 
Linking_ only require correct tokenization of the text. So it can be used for 
nearly every language. However _NOTE_ that POS (Part of Speech) tagging will 
greatly improve quality and also speed as it allows to only lookup Nouns. Also 
Chunking (Noun Phrase detection), NER and Lemmatization results are considered 
by Entity Linking to improve vocabulary lookups. For details see the 
documentation of the [Entity Linking 
Process](components/enhancer/engines/entitylinking#linking-process).
 
 The second big difference is that _Named Entity Linking_ can only support 
Entity types supported by the NER modles (Persons, Organizations and Places). 
_Entity Linking_ does not have this restriction. This advantage comes also with 
the disadvantage that Entity Lookups to the Controlled Vocabulary are only 
based on Label similarities. _Named Entity Linking_ does also use the type 
information provided by NER.
 
-To use _Entity Linking_ with a custom Vocabulary Users need to configure an 
instance of the [Entityhub Linking 
Engine](components/enhancer/engines/entityhublinking). While this Engine 
provides more than twenty configuration parameters the following list provides 
an overview about the most important. For detailed information please see the 
documentation of the Engine.
+To use _Entity Linking_ with a custom Vocabulary Users need to configure an 
instance of the [Entityhub Linking 
Engine](components/enhancer/engines/entityhublinking) or a [FST Linking 
engine](components/enhancer/engines/lucenefstlinking). While both of those 
Engines provides 20+ configuration parameters only very few of them are 
required for a working configuration.
 
-1. The "Name" of the enhancement engine. It is recommended to use something 
like "{name}Extraction" - where {name} is the name of the Entityhub Site
-2. The name of the "Managed- / Referenced Site" holding your vocabulary. Here 
you have to configure the {name}
-3. The "Label Field" is the URI of the property in your vocabulary providing 
the labels used for matching. You can only use a single field. If you want to 
use values of several fields you have two options: (1) to adapt your indexing 
configuration to copy the values of those fields to a single one (e.g. the 
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in 
the default configuration of the Entityhub indexing tool (see 
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple 
EntityubLinkingEngines - one for each label field. Option (1) is preferable as 
long as you do not need to use different configurations for the different 
labels.
+1. The "Name" of the enhancement engine. It is recommended to use something 
like "{name}Extraction" or "{name}-linking" - where {name} is the name of the 
Entityhub Site
+2. The link to the data source
+    * in case of the Entityhub Linking Engine this is the name of the 
"Managed- / Referenced Site" holding your vocabulary - so if you followed this 
scenario you need to configure the {name}
+    * in case of the FST linking engine this is the link to the SolrCore with 
the index of your custom vocabulary. If you followed this scenario you need to 
configure the {name} and set the field name encoding to "SolrYard".
+3. The configuration of the field used for linking
+    * in case of the Entityhub Linking Engine the "Label Field" needs to be 
set to the URI of the property holding the labels. You can only use a single 
field. If you want to use values of several fields you need to adapt your 
indexing configuration to copy the values of those fields to a single one (e.g. 
by adding `skos:prefLabel > rdfs:label` and `skos:altLabel > rdfs:label` to the 
`{indexing-working-dir}/indexing/config/mappings.txt` config.
+    * in case of the FST Linking engine you need to provide the [FST Tagging 
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration).
 If you store your labels in the `rdfs:label` field and you want to support all 
languages present in your vocabulary use `*;field=rdfs:label;generate=true`. 
_NOTE_ that `generate=true` is required to allow the engine to (re)create FST 
models at runtime.
 4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns 
(Named Entities) than this parameter should be activated. This options causes 
the Entity Linking process to not making queries for commons nouns and by that 
receding the number of queries agains the controlled vocabulary by ~70%. 
However this is not feasible if the vocabulary does contain Entities that are 
common nouns in the language. 
 5. The "Type Mappings" might be interesting for you if your vocabulary 
contains custom types as those mappings can be used to map 'rdf:type's of 
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - 
created by the Apache Stanbol Enhancer to annotate occurrences of extracted 
entities in the parsed text. See the [type mapping 
syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax)
 and the [usage scenario for the Apache Stanbol Enhancement 
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support) 
for details.
 
@@ -202,7 +207,7 @@ The following Example shows an Example o
 * opennlp-token - [OpenNLP based Word 
tokenization](components/enhancer/engines/opennlptokenizer). Works for all 
languages where white spaces can be used to tokenize.
 * opennlp-pos - [OpenNLP Part of Speech 
tagging](components/enhancer/engines/opennlppos)
 * opennlp-chunker - The [OpenNLP 
chunker](components/enhancer/engines/opennlpchunker) provides Noun Phrases
-* "{name}Extraction - the [Entityhub Linking 
Engine](components/enhancer/engines/entityhublinking) configured for the custom 
vocabulary.
+* "{name}Extraction - the [Entityhub Linking 
Engine](components/enhancer/engines/entityhublinking) or [FST Tagging 
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration)
 configured for the custom vocabulary.
 
 Both the [weighted chain](components/enhancer/chains/weightedchain.html) and 
the [list chain](components/enhancer/chains/listchain.html) can be used for the 
configuration of such a chain.

svn commit: r1599105 - /stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext

Reply via email to