Author: rwesten
Date: Mon Jun 2 07:54:02 2014
New Revision: 1599105
URL: http://svn.apache.org/r1599105
Log:
updated the Configuring Entity Linking section of the customvocabulary useage
scenario
Modified:
stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
Modified: stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext?rev=1599105&r1=1599104&r2=1599105&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext (original)
+++ stanbol/site/trunk/content/docs/trunk/customvocabulary.mdtext Mon Jun 2
07:54:02 2014
@@ -15,7 +15,8 @@ The aim of this usage scenario is to pro
## Overview
The following figure shows the typical Enhancement workflow that may start
with some preprocessing steps (e.g. the conversion of rich text formats to
plain text) followed by the Natural Language Processing phase. Next 'Semantic
Lifting' aims to connect the results of text processing and link it to the
application domain of the user. During Postprocessing those results may get
further refined.
-<p style="text-align: center;"></p>
This usage scenario is all about the Semantic Lifting phase. This phase is
most central to for how well enhancement results to match the requirements of
the users application domain. Users that need to process health related
documents will need to provide vocabularies containing life science related
entities otherwise the Stanbol Enhancer will not perform as expected on those
documents. Similar processing Customer requests can only work if Stanbol has
access to data managed by the CRM.
@@ -181,17 +182,21 @@ The following Example shows a [enhanceme
Both the [weighted chain](components/enhancer/chains/weightedchain.html) and
the [list chain](components/enhancer/chains/listchain.html) can be used for the
configuration of such a chain.
-### Configuring Named Entity Linking
+### Configuring Entity Linking
-First it is important to note the difference between _Named Entity Linking_
and _Entity Linking_. While _Named Entity Linking_ only considers _Named
Entities_ detected by NER (Named Entity Recognition) _Entity Linking_ does work
on Words (Tokens). Because of that is has much lower NLP requirements and can
even operate for languages where only word tokenization is supported. However
extraction results AND performance do greatly improve with POS (Part of Speech)
tagging support. Also Chunking (Noun Phrase detection), NER and Lemmatization
results can be consumed by Entity Linking to further improve extraction
results. For details see the documentation of the [Entity Linking
Process](components/enhancer/engines/entitylinking#linking-process).
+First it is important to note the difference between _Named Entity Linking_
and _Entity Linking_. While _Named Entity Linking_ only considers _Named
Entities_ detected by NER (Named Entity Recognition) _Entity Linking_ does work
on Words (Tokens). As NER support is only available for a limited number of
languages _Named Entity Linking_ is only an option for those languages. _Entity
Linking_ only require correct tokenization of the text. So it can be used for
nearly every language. However _NOTE_ that POS (Part of Speech) tagging will
greatly improve quality and also speed as it allows to only lookup Nouns. Also
Chunking (Noun Phrase detection), NER and Lemmatization results are considered
by Entity Linking to improve vocabulary lookups. For details see the
documentation of the [Entity Linking
Process](components/enhancer/engines/entitylinking#linking-process).
The second big difference is that _Named Entity Linking_ can only support
Entity types supported by the NER modles (Persons, Organizations and Places).
_Entity Linking_ does not have this restriction. This advantage comes also with
the disadvantage that Entity Lookups to the Controlled Vocabulary are only
based on Label similarities. _Named Entity Linking_ does also use the type
information provided by NER.
-To use _Entity Linking_ with a custom Vocabulary Users need to configure an
instance of the [Entityhub Linking
Engine](components/enhancer/engines/entityhublinking). While this Engine
provides more than twenty configuration parameters the following list provides
an overview about the most important. For detailed information please see the
documentation of the Engine.
+To use _Entity Linking_ with a custom Vocabulary Users need to configure an
instance of the [Entityhub Linking
Engine](components/enhancer/engines/entityhublinking) or a [FST Linking
engine](components/enhancer/engines/lucenefstlinking). While both of those
Engines provides 20+ configuration parameters only very few of them are
required for a working configuration.
-1. The "Name" of the enhancement engine. It is recommended to use something
like "{name}Extraction" - where {name} is the name of the Entityhub Site
-2. The name of the "Managed- / Referenced Site" holding your vocabulary. Here
you have to configure the {name}
-3. The "Label Field" is the URI of the property in your vocabulary providing
the labels used for matching. You can only use a single field. If you want to
use values of several fields you have two options: (1) to adapt your indexing
configuration to copy the values of those fields to a single one (e.g. the
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in
the default configuration of the Entityhub indexing tool (see
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple
EntityubLinkingEngines - one for each label field. Option (1) is preferable as
long as you do not need to use different configurations for the different
labels.
+1. The "Name" of the enhancement engine. It is recommended to use something
like "{name}Extraction" or "{name}-linking" - where {name} is the name of the
Entityhub Site
+2. The link to the data source
+ * in case of the Entityhub Linking Engine this is the name of the
"Managed- / Referenced Site" holding your vocabulary - so if you followed this
scenario you need to configure the {name}
+ * in case of the FST linking engine this is the link to the SolrCore with
the index of your custom vocabulary. If you followed this scenario you need to
configure the {name} and set the field name encoding to "SolrYard".
+3. The configuration of the field used for linking
+ * in case of the Entityhub Linking Engine the "Label Field" needs to be
set to the URI of the property holding the labels. You can only use a single
field. If you want to use values of several fields you need to adapt your
indexing configuration to copy the values of those fields to a single one (e.g.
by adding `skos:prefLabel > rdfs:label` and `skos:altLabel > rdfs:label` to the
`{indexing-working-dir}/indexing/config/mappings.txt` config.
+ * in case of the FST Linking engine you need to provide the [FST Tagging
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration).
If you store your labels in the `rdfs:label` field and you want to support all
languages present in your vocabulary use `*;field=rdfs:label;generate=true`.
_NOTE_ that `generate=true` is required to allow the engine to (re)create FST
models at runtime.
4. The "Link ProperNouns only": If the custom Vocabulary contains Proper Nouns
(Named Entities) than this parameter should be activated. This options causes
the Entity Linking process to not making queries for commons nouns and by that
receding the number of queries agains the controlled vocabulary by ~70%.
However this is not feasible if the vocabulary does contain Entities that are
common nouns in the language.
5. The "Type Mappings" might be interesting for you if your vocabulary
contains custom types as those mappings can be used to map 'rdf:type's of
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's -
created by the Apache Stanbol Enhancer to annotate occurrences of extracted
entities in the parsed text. See the [type mapping
syntax](components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax)
and the [usage scenario for the Apache Stanbol Enhancement
Structure](enhancementusage.html#entity-tagging-with-disambiguation-support)
for details.
@@ -202,7 +207,7 @@ The following Example shows an Example o
* opennlp-token - [OpenNLP based Word
tokenization](components/enhancer/engines/opennlptokenizer). Works for all
languages where white spaces can be used to tokenize.
* opennlp-pos - [OpenNLP Part of Speech
tagging](components/enhancer/engines/opennlppos)
* opennlp-chunker - The [OpenNLP
chunker](components/enhancer/engines/opennlpchunker) provides Noun Phrases
-* "{name}Extraction - the [Entityhub Linking
Engine](components/enhancer/engines/entityhublinking) configured for the custom
vocabulary.
+* "{name}Extraction - the [Entityhub Linking
Engine](components/enhancer/engines/entityhublinking) or [FST Tagging
Configuration](components/enhancer/engines/lucenefstlinking#fst-tagging-configuration)
configured for the custom vocabulary.
Both the [weighted chain](components/enhancer/chains/weightedchain.html) and
the [list chain](components/enhancer/chains/listchain.html) can be used for the
configuration of such a chain.