svn commit: r858939 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/ docs/trunk/components/enhancer/ docs/trunk/components/enhancer/engines/

buildbot Thu, 18 Apr 2013 05:20:37 -0700

Author: buildbot
Date: Thu Apr 18 12:20:11 2013
New Revision: 858939

Log:
Staging update by buildbot for stanbol


Added:
    websites/staging/stanbol/trunk/content/docs/trunk/enhancementworkflow.png   
(with props)
Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
    
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.html
    
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/enhancementstructure.html
    websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Apr 18 12:20:11 2013
@@ -1 +1 @@
-1467194
+1469293

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/entitylinking.html
 Thu Apr 18 12:20:11 2013
@@ -276,6 +276,8 @@ Configuration wise this will pre-set the
 <p><strong>Min Text Score</strong> 
<em>(enhancer.engines.linking.minTextScore)</em> [0..1]::double: The "Text 
Score" [0..1] represents how well the Label of an Entity matches to the 
selected Span in the Text. It compares the number of matched {@link Token} from 
the label with the number of Tokens enclosed by the Span in the Text an Entity 
is suggested for. Not exact matches for Tokens, or if the Tokens within the 
label do appear in an other order than in the text do also reduce this score. 
Entities are only considered if at least one of their labels cores higher than 
the minimum for all tree of <em>Min Labe Score</em>, <em>Min Text Match 
Score</em> and <em>Min Match Score</em>.</p>
 </li>
 <li><strong>Min Match Score</strong> 
<em>(enhancer.engines.linking.minMatchScore)</em> [0..1]::double: Defined as 
the product of the "Text Score" with the "Label Score" - meaning that this 
value represents both how well the label matches the text and how much of the 
label is matched with the text. Entities are only considered if at least one of 
their labels cores higher than the minimum for all tree of <em>Min Labe 
Score</em>, <em>Min Text Match Score</em> and <em>Min Match Score</em>. </li>
+<li><strong>Use EntityRankings</strong> 
<em>(enhancer.engines.linking.useEntityRankings)</em> ::boolean (default=true): 
Entity Rankings can be used to define the ranking (popularity, importance, 
connectivity, ...) of an entity relative to other within the knowledge base. 
While fise:confidence values calculated by the EntityLinkingEngie do only 
represent how well a label of the entity do match with the given section in the 
processed text it does make sense for manny use cases to sort Entities with the 
same score based on their entity rankings (e.g. users would expect to get 
"Paris (France)" suggested before "Paris (Texas)" for Paris appearing in a 
text. Enabling this feature will slightly (&lt; 0.1) change the score of 
suggestions to ensure such a ordering.   <br />
+</li>
 </ul>
 <h4 id="type-mappings-syntax">Type Mappings Syntax</h4>
 <p>The Type Mappings are used to determine the "dc:type" of the <a 
href="../enhancementstructure.html#fisetextannotation">TextAnnotation</a> based 
on the types of the suggested Entity. The field "Type Mappings" (property: 
<em>org.apache.stanbol.enhancer.engines.keywordextraction.typeMappings</em>) 
can be used to customize such mappings.</p>

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/engines/keywordlinkingengine.html
 Thu Apr 18 12:20:11 2013
@@ -102,7 +102,7 @@
 <ul>
 <li><strong>Name</strong> <em>(stanbol.enhancer.engine.name)</em>: The name of 
the Enhancement Engine. This name is used to refer an <a 
href="index.html">EnhancementEngine</a> in <a 
href="enhancementchain.html">EnhancementChain</a>s</li>
 <li><strong>Referenced Site</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.referencedSiteId)</em>:
 The name of the ReferencedSite of the Stanbol Entityhub that holds the 
controlled vocabulary to be used for extracting Entities. "entityhub" or 
"local" can be used to extract Entities managed directly by the Entityhub.</li>
-<li><strong>Label Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)</em>: The 
name of the property used to lookup Entities. Only a single field is supported 
for performance reasons. Users that want to use values of several fields should 
collect such values by an according configuration in the mappings.txt used 
during indexing. This <a href="../../customvocabulary.html">usage scenario</a> 
provides more information on this.</li>
+<li><strong>Label Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.nameField)</em>: The 
name of the property used to lookup Entities. Only a single field is supported 
for performance reasons. Users that want to use values of several fields should 
collect such values by an according configuration in the mappings.txt used 
during indexing. This <a href="../../../customvocabulary.html">usage 
scenario</a> provides more information on this.</li>
 <li><strong>Case Sensitivity</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.caseSensitive)</em>: 
This allows to activate/deactivate case sensitive matching. It is important to 
understand that even with case sensitivity activated an Entity with the label 
such as "Anaconda" will be suggested for the mention of "anaconda" in the text. 
The main difference will be the confidence value of such a suggestion as with 
case sensitivity activated the starting letters "A" and "a" are NOT considered 
to be matching. See the second technical part for details about the matching 
process. Case Sensitivity is deactivated by default. It is recommended to be 
activated if controlled vocabularies contain abbreviations similar to commonly 
used words e.g. CAN for Canada.</li>
 <li><strong>Type Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.typeField)</em>: 
Values of this field are used as values of the "fise:entity-types" property of 
created "<a 
href="../enhancementstructure.html#fiseentityannotation">fise:EntityAnnotation</a>"s.
 The default is "rdf:type".</li>
 <li><strong>Redirect Field</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectField)</em> 
and <strong>Redirect Mode</strong> 
<em>(org.apache.stanbol.enhancer.engines.keywordextraction.redirectMode)</em>: 
Redirects allow to tell the KeywordLinkingEngine to follow a specific property 
in the knowledge base for matched entities. This feature e.g. allows to follow 
redirects from "USA" to "United States" as defined in Wikipedia. See 
"Processing of Entity Suggestions" for details. Possible valued for the 
Redirect Mode are "IGNORE" - deactivates this feature; "ADD_VALUES" - uses 
label, type informations of redirected entities, but keeps the URI of the 
extracted entity; "FOLLOW" - follows the redirect</li>

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/enhancementstructure.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/enhancementstructure.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/enhancementstructure.html
 Thu Apr 18 12:20:11 2013
@@ -101,7 +101,7 @@
 </ul>
 </li>
 </ol>
-<p>While this document focuses on the first Engine and provides details on how 
the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer 
there is also a <a href="../enhancementusage.html">Usage Scenario</a> available 
that focuses on how the Enhancements can be consumed by Stanbol Enhancer 
users.</p>
+<p>While this document focuses on the first Engine and provides details on how 
the Stanbol Enhancement Structure it the integral part of the Stanbol Enhancer 
there is also a <a href="../../enhancementusage.html">Usage Scenario</a> 
available that focuses on how the Enhancements can be consumed by Stanbol 
Enhancer users.</p>
 <h2 id="overview-on-the-stanbol-enhancement-structure">Overview on the Stanbol 
Enhancement Structure</h2>
 <p>The Stanbol Enhancement Structure is a central part of the <a 
href="index.html">Stanbol Enhancer</a> architecture as it represents the 
binding element between the <a href="contentitem.html">ContentItem</a> analyzed 
by the the <a href="engines">EnhancementEngine</a>s as configured by an <a 
href="chains">EnhancementChain</a>. Together with the <a 
href="contentitem.html#content-parts">ContentParts</a> it represents the state 
that is constantly updated during the enhancement process.</p>
 <p>The following graphic provides an overview on how the EnhancementStructure 
is used by the Stanbol Enhancer to formally represent the enhancement 
results.</p>

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html
==============================================================================
--- websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html 
(original)
+++ websites/staging/stanbol/trunk/content/docs/trunk/customvocabulary.html Thu 
Apr 18 12:20:11 2013
@@ -93,22 +93,54 @@
 <p style="text-align: center;"><img alt="Customizing Stanbol for specific 
Domains" src="enhancer-custom-vocabularies.png" title="The Stanbol Enhancer 
extracts Entities based on domain specific Vocabularies indexed and imported to 
the Stanbol Entityhub." /></p></p>
 <p>The aim of this usage scenario is to provide Apache Stanbol users with all 
the required knowledge to customize Apache Stanbol to be used in their specific 
domain. This includes</p>
 <ul>
-<li>Index custom Vocabularies using the Entityhub Indexing Tool</li>
+<li>Two possibilities to manage custom Vocabularies<ol>
+<li>via the RESTful interface provided by a Managed Site or<br />
+</li>
+<li>by using a ReferencedSite with a full local index</li>
+</ol>
+</li>
+<li>Building full local indexes with the Entityhub Indexing Tool</li>
 <li>Importing Indexes to Apache Stanbol</li>
 <li>Configuring the Stanbol Enhancer to make use of the indexed and imported 
Vocabularies</li>
 </ul>
 <h2 id="overview">Overview</h2>
-<p>For text enhancement and linking to external sources, the Entityhub 
component of Apache Stanbol allows to work with local indexes of datasets. This 
has several advantages. </p>
-<ul>
-<li>You do not rely on internet connectivity, thus it is possible to operate 
offline with a huge set of entities.</li>
-<li>You can do local updates of these datasets.</li>
-<li>You can work with local resources, such as your LDAP directory or a 
specific and private enterprise vocabulary of a specific domain.</li>
-</ul>
-<p>Creating your own indexes is the preferred way of working with custom 
vocabularies. Small vocabularies can also be uploaded to the Entityhub as 
ontologies, directly. A downside to this approach is that only one ontology per 
installation is supported.</p>
-<p>If you want to use multiple datasets in parallel, you have to create a 
local index for these datasets and configure the Entityhub to use them. In the 
following we will focuses on the main case, which is: Creating and using a 
local <a href="http://lucene.apache.org/solr/";>Apache Solr</a> index of a 
custom vocabulary, e.g. a SKOS thesaurus or taxonomy of your domain.</p>
-<h2 id="creating-and-working-with-custom-local-indexes">Creating and working 
with custom local indexes</h2>
-<p>Apache Stanbol provides the machinery to start with vocabularies in 
standard languages such as <a href="http://www.w3.org/2004/02/skos/";>SKOS</a> 
or <a href="http://www.w3.org/TR/rdf-primer/";>RDF</a> encoded data sets. The 
Apache Stanbol components, which are needed for this functionality are the 
Entityhub and its indexing tool for creating and managing the index and <a 
href="components/enhancer/engines">enhancement engines</a> that make use of the 
indexes during the enhancement process.</p>
-<p>To create and import your own vocabulary to the Apache Stanbol Entityhub 
you need to follow the following Steps</p>
+<p>The following figure shows the typical Enhancement workflow that may start 
with some preprocessing steps (e.g. the conversion of rich text formats to 
plain text) followed by the Natural Language Processing phase. Next 'Semantic 
Lifting' aims to connect the results of text processing and link it to the 
application domain of the user. During Postprocessing those results may get 
further refined.
+<p style="text-align: center;">![Typical Enhancement 
Workflow](enhancementworkflow.png "The typical Enhancement Chain includes the 
</p>
+<p>This usage scenario is all about the Semantic Lifting phase. This phase is 
most central to for how well enhancement results to match the requirements of 
the users application domain. Users that need to process health related 
documents will need to provide vocabularies containing life science related 
entities otherwise the Stanbol Enhancer will not perform as expected on those 
documents. Similar processing Customer requests can only work if Stanbol has 
access to data managed by the CRM.</p>
+<p>This scenario aims to provide Stanbol users with all information necessary 
to use Apache Stanbol in scenarios where domain specific vocabularies are 
required.<br />
+</p>
+<h2 id="managing-custom-vocabularies-with-the-stanbol-entityhub">Managing 
Custom Vocabularies with the Stanbol Entityhub</h2>
+<p>By default the Stanbol Enhancer does use the Entityhub component for 
linking Entities with mentions in the processed text. While Users may extend 
the Enhancer to allow the usage of other sources this is outside of the scope 
of this scenario.</p>
+<p>The Stanbol Entityhub provides two possibilities to manage vocabularies</p>
+<ol>
+<li><strong><a href="components/entityhub/managedsite">Managed 
Sites</a></strong>: A fully read/write able storage for Entities. Once created 
users can use a RESTful interface to create, update, retrieve, query and delete 
entities.</li>
+<li><strong>Referenced Site</strong>: A read-only version of a Site that can 
either be used as a local cache of remotely managed data (such as a <a 
href="http://linkeddata.org/";>Linked Data</a> server) or use a fully local 
index of the knowledge base - the relevant case in the context of this 
scenario.</li>
+</ol>
+<p>As a rule of thump users should prefer to use a <strong>Managed 
Site</strong> if the vocabulary does change regularly and those changes need to 
be reflected in enhancement results of processed documents. A 
<strong>Referenced Site</strong> is typically the better choice for 
vocabularies that do not change on a regular base and/or for users that what to 
use apply advanced rules while indexing a dataset.</p>
+<h3 id="using-a-entityhub-managed-site">Using a Entityhub Managed Site</h3>
+<p>How to use a Managed Site is already described in detail by the <a 
href="components/entityhub/managedsite">Documentation of Managed Sites</a>. To 
configure a new Managed Site on the Entityhub users need to create two 
components:</p>
+<ol>
+<li>the <em>Yard</em> - the storage component of the Stanbol Entityhub. While 
there are multiple Yard implementations, when used for EntiyLinking the <a 
href="components/entityhub/managedsite#configuration-of-a-solryard">SolrYard 
implementation</a> should be used. Second the </li>
+<li>the <em><a 
href="components/entityhub/managedsite#configuration-of-the-yardsite">YardSite</a></em>
 - the component that implements the ManagedSite interface.</li>
+</ol>
+<p>After completing those two steps an empty Managed site should be ready to 
use available under</p>
+<div class="codehilite"><pre><span class="n">http:</span><span 
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">host</span><span class="p">}</span><span 
class="sr">/entityhub/si</span><span class="n">tes</span><span 
class="sr">/{managed-site-name}/</span>
+</pre></div>
+
+
+<p>and users can start to upload the Entities of the controlled vocabulary by 
using the RESTful interface such as</p>
+<div class="codehilite"><pre><span class="n">curl</span> <span 
class="o">-</span><span class="n">i</span> <span class="o">-</span><span 
class="n">X</span> <span class="n">PUT</span> <span class="o">-</span><span 
class="n">H</span> <span class="s">&quot;Content-Type: 
application/rdf+xml&quot;</span> <span class="o">-</span><span 
class="n">T</span> <span class="p">{</span><span class="n">rdf</span><span 
class="o">-</span><span class="n">xml</span><span class="o">-</span><span 
class="n">data</span><span class="p">}</span> <span class="o">\</span>
+    <span 
class="s">&quot;http://{stanbol-host}/entityhub/site/{managed-site-name}/entity&quot;</span>
+</pre></div>
+
+
+<p>In case you have opted to use a <em>Managed Site</em> for managing your 
entities you can now skip the next section until section 'B. Configure and use 
the index with the Apache Stanbol Enhancer'</p>
+<h3 id="using-a-entityhub-referenced-site">Using a Entityhub Referenced 
Site</h3>
+<p>Referenced Sites are used by the Stanbol Entityhub to reference external 
knowledge bases. This can be done by configuring remote services for 
dereferencing and querying information, but also by providing a full local 
index of the referenced knowledge base. </p>
+<p>When using a Referenced Site in combination with the Stanbol Enhancer it is 
highly recommended for performance considerations to provide a full local 
index. To create such local indexes Stanbol provides the <em>Entityhub Indexing 
Tool</em>. See the following section for detailed information on how to use 
this tool.</p>
+<h2 id="building-full-local-indexes-with-the-entityhub-indexing-tool">Building 
full local indexes with the Entityhub Indexing Tool</h2>
+<p>The Entityhub Indexing Tool allows to create full local indexes of 
knowledge bases that can be loaded to the Stanbol Entityhub as Referenced 
Sites. Users that do use Managed Sites may want to skip this section.</p>
+<p>Users of the Entityhub Indexing Tool will typically need to complete the 
steps described in the following sub sections.</p>
 <h3 id="step-1-compile-and-assemble-the-indexing-tool">Step 1 : Compile and 
assemble the indexing tool</h3>
 <p>The indexing tool provides a default configuration for creating an <a 
href="http://lucene.apache.org/solr/";>Apache Solr</a> index of RDF files (e.g. 
a SKOS export of a thesaurus or a set of foaf files).</p>
 <p>To build the indexing tool from source - recommended - you will need to 
checkout Apache Stanbol form SVN (or <a href="../../downloads">download</a> a 
source-release). Instructions for this can be found <a 
href="tutorial.html">here</a>. However if you want to skip this you can also 
obtain a <a 
href="http://dev.iks-project.eu/downloads/stanbol-launchers/";>binary 
version</a> from the IKS development server (search the sub-folders of the 
different versions for a file named like 
"<code>org.apache.stanbol.entityhub.indexing.genericrdf-*-jar-with-dependencies.jar</code>").</p>
@@ -142,7 +174,9 @@ org.apache.stanbol.entityhub.indexing.ge
 <p>After the initialization you will need to provide the following 
configurations in files located in the configuration folder 
(<code>{indexing-working-dir}/indexing/config</code>)</p>
 <ul>
 <li>Within the <code>indexing.properties</code> file you need to set the 
{name} of your index by changing the value of the "name" property. In addition 
you should also provide a "description". At the end of the indexing.properties 
file you can also specify the license and attribution for the data you index. 
The Apache Entityhub will ensure that those information will be included with 
any entity data returned for requests.</li>
-<li>If the data you index do use some none common namespaces you will need to 
add those to the <code>mapping.txt</code> file (here is an <a 
href="examples/anl-mappings.txt">example</a>  including default and specific 
mappings for one dataset)</li>
+<li>Optionally, if your data do use namespaces that are not present in <a 
href="http://prefix.cc";>prefix.cc</a> (or the server used for indexing does not 
have internet connectivity) you can manually define required prefixes by 
creating/using the a <code>indexing/config/namespaceprefix.mappings</code> 
file. The syntax is '<code>'{prefix}\t{namespace}\n</code>' where 
'<code>{prefix} ... [0..9A..Za..z-_]</code>' and '<code>{namespace} ... must 
end with '#' or '/' for URLs and ':' for URNs</code>'.</li>
+<li>Optionally, if the data you index do use some none common namespaces you 
will need to add those to the <code>mapping.txt</code> file (here is an <a 
href="examples/anl-mappings.txt">example</a>  including default and specific 
mappings for one dataset)</li>
+<li>Optionally, if you want to use a custom SolrCore configuration the core 
configuration needs to be copied to the 
<code>indexing/config/{core-name}</code>. Default configuration - to start from 
- can be downloaded from the <a 
href="https://svn.apache.org/repos/asf/stanbol/trunk/entityhub/yard/solr/src/main/resources/solr/core/";>Stanbol
 SVN</a> and extracted to the <code>indexing/config/</code> folder. If the 
{core-name} is different from the 'name' configured in the 
<code>indexing.properties</code> than the '<code>solrConf</code>' parameter of 
the '<code>indexingDestination</code>' MUST be set to 
'<code>solrConf:{core-name}</code>'. After those configurations users can make 
custom adaptations to the SolrCore configuration used for indexing. </li>
 </ul>
 <p>Finally you will also need to copy your source files into the source 
directory <code>{indexing-working-dir}/indexing/resources/rdfdata</code>. All 
files within this directory will be indexed. THe indexing tool support most 
common RDF serialization. You can also directly index compressed RDF files.</p>
 <p>For more details about possible configurations, please consult the <a 
href="https://github.com/apache/stanbol/blob/trunk/entityhub/indexing/genericrdf/README.md";>README</a>.</p>
@@ -152,13 +186,13 @@ org.apache.stanbol.entityhub.indexing.ge
 </pre></div>
 
 
-<p>Depending on your hardware and on complexity and size of your sources, it 
may take several hours to built the index. As a result, you will get an archive 
of an <a href="http://lucene.apache.org/solr/";>Apache Solr</a> index together 
with an OSGI bundle to work with the index in Stanbol.</p>
+<p>Depending on your hardware and on complexity and size of your sources, it 
may take several hours to built the index. As a result, you will get an archive 
of an <a href="http://lucene.apache.org/solr/";>Apache Solr</a> index together 
with an OSGI bundle to work with the index in Stanbol. Both files will be 
located within the <code>indexing/dist</code> folder.</p>
 <p><em>IMPORTANT NOTES:</em> </p>
 <ul>
 <li>
 <p>The import of the RDF files to the Jena TDB triple store - used as source 
for the indexing - takes a lot of time. Because of that imported data are 
reused for multiple runs of the indexing tool. This has two important effects 
users need to be aware of:</p>
 <ol>
-<li>Already imported RDF files should be removed from the 
<code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to 
re-import them on every run of the tool</li>
+<li>Already imported RDF files should be removed from the 
<code>{indexing-working-dir}/indexing/resources/rdfdata</code> to avoid to 
re-import them on every run of the tool. NOTE: newer versions of the Entityhub 
indexing tool might automatically move successfully imported RDF files to a 
different folder.</li>
 <li>If the RDF data change you will need to delete the Jena TDB store so that 
those changes are reflected in the created index. To do this delete the 
<code>{indexing-working-dir}/indexing/resources/tdb</code> folder</li>
 </ol>
 </li>
@@ -169,7 +203,7 @@ org.apache.stanbol.entityhub.indexing.ge
 <h3 id="step-3-initialize-the-index-within-apache-stanbol">Step 3 : Initialize 
the index within Apache Stanbol</h3>
 <p>We assume that you already have a running Apache Stanbol instance at 
http://{stanbol-host} and that {stanbol-working-dir} is the working directory 
of that instance on the local hard disk. To install the created index you need 
to </p>
 <ul>
-<li>copy the "{name}.solrindex.zip" file to the 
<code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run 
the 0.9.0-incubating version the path is 
<code>{stanbol-working-dir}/sling/datafiles</code>.</li>
+<li>copy the "{name}.solrindex.zip" file to the 
<code>{stanbol-working-dir}/stanbol/datafiles</code> directory (NOTE if you run 
the 0.9.0-incubating version the path is 
<code>{stanbol-working-dir}/sling/datafiles</code>).</li>
 <li>install the <code>org.apache.stanbol.data.site.{name}-{version}.jar</code> 
to the OSGI environment of your Stanbol instance e.g. by using the Bundle tab 
of the Apache Felix web console at 
</code>http://{stanbol-host}/system/console/bundles</code></li>
 </ul>
 <p>You find both files in the 
<code>{indexing-working-dir}/indexing/dist/</code> folder.</p>
@@ -179,7 +213,7 @@ org.apache.stanbol.entityhub.indexing.ge
 
 
 <p>You can use the Web UI of the Stanbol Enhancer to explore your vocabulary. 
Note, that in case of big vocabulary it might take some time until the site 
becomes functional.</p>
-<h2 id="b-configure-and-use-the-index-with-the-apache-stanbol-enhancer">B. 
Configure and use the index with the Apache Stanbol Enhancer</h2>
+<h2 
id="configuring-the-stanbol-enhancer-for-your-custom-vocabularies">Configuring 
the Stanbol Enhancer for your custom Vocabularies</h2>
 <p>This section covers how to configure the Apache Stanbol Enhancer to 
recognize and link entities of your custom vocabulary with processed 
documents.</p>
 <p>Generally there are two possible ways you can use to recognize entities of 
your vocabulary:</p>
 <ol>
@@ -193,43 +227,48 @@ org.apache.stanbol.entityhub.indexing.ge
 <p>In case named entity linking is used the linking with the custom vocabulary 
is done by the <a 
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity 
Tagging Engine</a>.
 For the configuration of this engine you need to provide the following 
parameters</p>
 <ol>
-<li>The "name" of the enhancement engine. It is recommended to use 
"{name}Linking" - where {name} is the name of your vocabulary as used in part 
A. of this scenario.</li>
+<li>The "name" of the enhancement engine. It is recommended to use 
"{name}Linking" - where {name} is the name of the Entityhub Site (ReferenceSite 
or ManagedSite).</li>
 <li>The name of the referenced site holding your vocabulary. Here you have to 
configure the {name}.</li>
 <li>Enable/disable persons, organizations and places and if enabled configure 
the <code>rdf:type</code> used by your vocabulary for those type. If you do not 
want to restrict the type, you can also leave the type field empty.</li>
 <li>Define the property used to match against the named entities detected by 
the used NER engine(s).</li>
 </ol>
 <p>For more detailed information please see the documentation of the <a 
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity 
Tagging Engine</a>.</p>
-<p>Note, that for using named entity linking you need also ensure that an 
enhancement engine that provides NER is available in the <a 
href="components/enhancer/chains">enhancement chain</a>. By default Apache 
Stanbol includes three different engines that provide this feature: (1) <a 
href="components/enhancer/engines/namedentityextractionengine.html">Named 
Entity Extraction Enhancement Engine</a> based on <a 
href="http://opennlp.apache.org";>OpenNLP</a>, (2) CELI NER engine based on the 
<a href="http://Linguagrid.org";>linguagrid.org</a> service and (3) <a 
href="components/enhancer/engines/opencalaisengine.html">OpenCalais Enhancement 
Engine</a> based on <a href="http://opencalais.com";>OpenCalais</a>. Note that 
the later two options will require to send your content to the according 
services that are not part of your local Apache Stanbol instance.</p>
-<p>A typical <a href="components/enhancer/chains">enhancement chain</a> for 
named entity linking with your custom vocabulary might look like</p>
+<p>Note, that for using named entity linking you need also ensure that an 
enhancement engine that provides NER (Named Entity Recoqunition) is available 
in the <a href="components/enhancer/chains">enhancement chain</a>. See <a 
href="components/enhancer/nlp/#stanbol-enhancer-nlp-support">Stanbol NLP 
processing Language Support</a> section for detailed information on Languages 
with NER support.</p>
+<p>The following Example shows a <a 
href="components/enhancer/chains">enhancement chain</a> for named entity 
linking based on OpenNLP and CELI as NLP processing modules</p>
 <ul>
-<li>"langid" - <a 
href="components/enhancer/engines/langidengine.html">Language Identification 
Engine</a> - to detect the language of the parsed content - a pre-requirement 
of all NER engines</li>
-<li>"ner" - for NER support in English, Spanish and Dutch via the <a 
href="components/enhancer/engines/namedentityextractionengine.html">Named 
Entity Extraction Enhancement Engine</a></li>
+<li>"langdetect" - <a 
href="components/enhancer/engines/langdetectengine">Language Detection 
Engine</a> - to detect the language of the parsed content - a pre-requirement 
of all NER engines</li>
+<li>"opennlp-ner" - for NER support in English, Spanish and Dutch via the <a 
href="components/enhancer/engines/namedentityextractionengine.html">Named 
Entity Extraction Enhancement Engine</a></li>
 <li>"celiNer" - for NER support in French and Italien via the CELI NER 
engine</li>
 <li>"{name}Linking - the <a 
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity 
Tagging Engine</a> for your vocabulary as configured above.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted 
chain</a> and the <a href="components/enhancer/chains/listchain.html">list 
chain</a> can be used for the configuration of such a chain.</p>
-<h3 id="configure-keyword-linking">Configure Keyword Linking</h3>
-<p>In case you want to use keyword linking to extract and link entities of 
your vocabulary you will need to configure the <a 
href="components/enhancer/engines/keywordlinkingengine.html">Keyword Linking 
Engine</a> accordingly.</p>
-<p>Here are the most important configuration options provided by the Keyword 
Linking Engine when configured via the <a 
href="http://localhost:8080/system/console/configMgr";>configuration tab</a> of 
the Apache Felix web console - http://{host}:{port}/system/console/configMgr. 
For the full list and detailed information please see the <a 
href="components/enhancer/engines/keywordlinkingengine.html">documentation</a>).</p>
-<ol>
-<li>The "Name" of the enhancement engine. It is recommended to use 
"{name}Keyword" - where {name} is the name of your vocabulary as used in part 
A. of this scenario</li>
-<li>The name of the "Referenced Site" holding your vocabulary. Here you have 
to configure the {name}</li>
-<li>The "Label Field" is the URI of the property in your vocabulary providing 
the labels used for matching. You can only use a single field. If you want to 
use values of several fields you have two options: (1) to adapt your indexing 
configuration to copy the values of those fields to a single one (e.g. the 
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in 
the default configuration of the Entityhub indexing tool (see 
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple 
Keyword Linking Engine(s) - one for each label field. Option (1) is preferable 
as long as you do not need to use different configurations for the different 
labels.</li>
+<h3 id="configuring-named-entity-linking_1">Configuring Named Entity 
Linking</h3>
+<p>First it is important to note the difference between <em>Named Entity 
Linking</em> and <em>Entity Linking</em>. While <em>Named Entity Linking</em> 
only considers <em>Named Entities</em> detected by NER (Named Entity 
Recognition) <em>Entity Linking</em> does work on Words (Tokens). Because of 
that is has much lower NLP requirements and can even operate for languages 
where only word tokenization is supported. However extraction results AND 
performance do greatly improve with POS (Part of Speech) tagging support. Also 
Chunking (Noun Phrase detection), NER and Lemmatization results can be consumed 
by Entity Linking to further improve extraction results. For details see the 
documentation of the <a 
href="components/enhancer/engines/entitylinking#linking-process">Entity Linking 
Process</a>.</p>
+<p>The second big difference is that <em>Named Entity Linking</em> can only 
support Entity types supported by the NER modles (Persons, Organizations and 
Places). <em>Entity Linking</em> does not have this restriction. This advantage 
comes also with the disadvantage that Entity Lookups to the Controlled 
Vocabulary are only based on Label similarities. <em>Named Entity Linking</em> 
does also use the type information provided by NER.</p>
+<p>To use <em>Entity Linking</em> with a custom Vocabulary Users need to 
configure an instance of the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a>. While this Engine provides more than twenty configuration 
parameters the following list provides an overview about the most important. 
For detailed information please see the documentation of the Engine.</p>
+<ol>
+<li>The "Name" of the enhancement engine. It is recommended to use something 
like "{name}Extraction" - where {name} is the name of the Entityhub Site</li>
+<li>The name of the "Managed- / Referenced Site" holding your vocabulary. Here 
you have to configure the {name}</li>
+<li>The "Label Field" is the URI of the property in your vocabulary providing 
the labels used for matching. You can only use a single field. If you want to 
use values of several fields you have two options: (1) to adapt your indexing 
configuration to copy the values of those fields to a single one (e.g. the 
values of "skos:prefLabel" and "skos:altLabel" are copied to "rdfs:label" in 
the default configuration of the Entityhub indexing tool (see 
{indexing-working-dir}/indexing/config/mappings.txt) (2) to configure multiple 
EntityubLinkingEngines - one for each label field. Option (1) is preferable as 
long as you do not need to use different configurations for the different 
labels.</li>
+<li>The "Link ProperNouns only": If the custom Vocabulary contains Proper 
Nouns (Named Entities) than this parameter should be activated. This options 
causes the Entity Linking process to not making queries for commons nouns and 
by that receding the number of queries agains the controlled vocabulary by 
~70%. However this is not feasible if the vocabulary does contain Entities that 
are common nouns in the language. </li>
 <li>The "Type Mappings" might be interesting for you if your vocabulary 
contains custom types as those mappings can be used to map 'rdf:type's of 
entities in your vocabulary to 'dc:type's used for 'fise:TextAnnotation's - 
created by the Apache Stanbol Enhancer to annotate occurrences of extracted 
entities in the parsed text. See the <a 
href="components/enhancer/engines/keywordlinkingengine.html#type-mappings-syntax">type
 mapping syntax</a> and the <a 
href="enhancementusage.html#entity-tagging-with-disambiguation-support">usage 
scenario for the Apache Stanbol Enhancement Structure</a> for details.</li>
 </ol>
-<p>A typical <a href="components/enhancer/chains">enhancement chain</a> for 
named entity linking with your vocabulary might look like</p>
+<p>The following Example shows an Example of an <a 
href="components/enhancer/chains">enhancement chain</a> using OpenNLP for 
NLP</p>
 <ul>
-<li>"langid" - <a 
href="components/enhancer/engines/langidengine.html">Language Identification 
Engine</a> - to detect the language of the parsed content - a pre-requirement 
of the Keyword Linking Engine.</li>
-<li>"{name}Keyword - the <a 
href="components/enhancer/engines/keywordlinkingengine.html">Keyword Linking 
Engine</a> for your vocabulary as configured above.</li>
+<li>"langdetect" - <a 
href="components/enhancer/engines/langdetectengine">Language Detection 
Engine</a> - to detect the language of the parsed content - a pre-requirement 
of all NER engines</li>
+<li>opennlp-sentence - <a 
href="components/enhancer/engines/opennlpsentence">Sentence detection with 
OpenNLP</a></li>
+<li>opennlp-token - <a 
href="components/enhancer/engines/opennlptokenizer">OpenNLP based Word 
tokenization</a>. Works for all languages where white spaces can be used to 
tokenize.</li>
+<li>opennlp-pos - <a href="components/enhancer/engines/opennlppos">OpenNLP 
Part of Speech tagging</a></li>
+<li>opennlp-chunker - The <a 
href="components/enhancer/engines/opennlpchunker">OpenNLP chunker</a> provides 
Noun Phrases</li>
+<li>"{name}Extraction - the <a 
href="components/enhancer/engines/entityhublinking">Entityhub Linking 
Engine</a> configured for the custom vocabulary.</li>
 </ul>
 <p>Both the <a href="components/enhancer/chains/weightedchain.html">weighted 
chain</a> and the <a href="components/enhancer/chains/listchain.html">list 
chain</a> can be used for the configuration of such a chain.</p>
+<p>The documentation of the Stanbol NLP processing module provides <a 
href="components/enhancer/nlp/#stanbol-enhancer-nlp-support">detailed 
Information</a> about integrated NLP frameworks and suupported languages.</p>
 <h3 id="how-to-use-enhancement-chains">How to use enhancement chains</h3>
-<p>In the default configuration the Apache Stanbol Enhancer provides two 
enhancement chains:</p>
-<p>1) a "default" chain that includes all currently active <a 
href="components/enhancer/engines">enhancement engines</a> and 
-2) the "language" chain that is intended to be used to detect the language of 
parsed content.</p>
-<p>As soon as Apache Stanbol users start to add own vocabularies to the Apache 
Stanbol Entityhub and configure <a 
href="components/enhancer/engines/namedentitytaggingengine.html">Named Entity 
Tagging Engine</a> or <a 
href="components/enhancer/engines/keywordlinkingengine.html">Keyword Linking 
Engine</a>, the default chain, which includes all active engines, may become 
unusable. Most likely users want to deactivate the "default" chain and 
configure their own. This section provides more information on how to do 
that.</p>
-<p><strong>Deactivate the chain of all active enhancement engines</strong></p>
-<p>Users that add additional enhancement engines might need to deactivate the 
enhancement chain that includes all active engines. This can be done in the 
configuration tab of the Apache Felix web console - <a 
href="http://localhost:8080/system/console/configMgr";>http://{stabol-host}/system/console/configMgr</a>.
 Open the configuration dialog of the "Apache Stanbol Enhancer Chain: Default 
Chain" component and deactivate it.</p>
+<p>In the default configuration the Apache Stanbol Enhancer provides several 
enhancement chains including:</p>
+<p>1) a "default" chain providing <em>Named Entity Linking</em> based on 
DBpedia and <em>Entity Linking</em> based on the Entityhub
+2) the "language" chain that is intended to be used to detect the language of 
parsed content.
+3) a "dbpedia-proper-noun-linking" chain showing <em>Named Entity Linking</em> 
based on DBpedia</p>
 <p><strong>Change the enhancement chain bound to "/enhancer"</strong></p>
 <p>The enhancement chain bound to </p>
 <div class="codehilite"><pre><span class="n">http:</span><span 
class="sr">//</span><span class="p">{</span><span class="n">stanbol</span><span 
class="o">-</span><span class="n">host</span><span class="p">}</span><span 
class="o">/</span><span class="n">enhancer</span>

Added: websites/staging/stanbol/trunk/content/docs/trunk/enhancementworkflow.png
==============================================================================
Binary file - no diff available.

Propchange: 
websites/staging/stanbol/trunk/content/docs/trunk/enhancementworkflow.png
------------------------------------------------------------------------------
    svn:mime-type = image/png

svn commit: r858939 - in /websites/staging/stanbol/trunk/content: ./ docs/trunk/ docs/trunk/components/enhancer/ docs/trunk/components/enhancer/engines/

Reply via email to