Author: buildbot
Date: Wed Jan 30 14:50:14 2013
New Revision: 848597
Log:
Staging update by buildbot for stanbol
Modified:
websites/staging/stanbol/trunk/content/ (props changed)
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Jan 30 14:50:14 2013
@@ -1 +1 @@
-1440412
+1440437
Modified:
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
==============================================================================
---
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
(original)
+++
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
Wed Jan 30 14:50:14 2013
@@ -87,71 +87,68 @@
<ul> <li><a href="/">Home</a></li> <li class="item"><a
href="/docs/">Docs</a></li> <li class="item"><a
href="/docs/trunk/">Trunk</a></li> <li class="item"><a
href="/docs/trunk/components/">Components</a></li> <li class="item"><a
href="/docs/trunk/components/enhancer/">Enhancer</a></li> <li class="item"><a
href="/docs/trunk/components/enhancer/nlp/">Nlp</a></li> </ul>
</div>
<h1 class="title">Stanbol Enhancer Natural Language Processing Support
</h1>
- <p><strong>NOTE:</strong> The NLP processing module for the Stanbol
Enhancer was introduced by <a
href="https://issues.apache.org/jira/browse/STANBOL-733">STANBOL-733</a> and is
only available to Stanbol Enhancer starting from version <code>0.10.0</code></p>
+ <p><strong>NOTE:</strong> The NLP processing module for the Apache Stanbol
Enhancer was introduced in <a
href="https://issues.apache.org/jira/browse/STANBOL-733">STANBOL-733</a> and is
only available in Apache Stanbol Enhancer versions starting from
<code>0.10.0</code>.</p>
<h2 id="overview">Overview:</h2>
<p>This section covers the following topics:</p>
<ul>
<li><a href="#stanbol-natural-language-processing">Stanbol Natural Language
Processing</a>: Short introduction to NLP techniques used by the Stanbol
Enahncer</li>
-<li>The <a href="#nlp-processing-api">NLP processing API</a>: Information
about the Java API of the NLP processing Framework including information on<ul>
-<li>How to implement an <a href="nlpengine">NLP EnhancementEngine</a> and</li>
-<li>How to integrate NLP frameworks as a <a
href="restfulnlpanalysisservice">RESTful NLP Analyses Service</a> and <a
href="restfullangidentservice">RESTful Language Identification Service</a></li>
+<li>The <a href="#nlp-processing-api">NLP processing API</a>: Information
about the Java API of the NLP processing framework including information on<ul>
+<li>how to implement an <a href="nlpengine">NLP Enhancement Engine</a> and</li>
+<li>how to integrate third party NLP frameworks as a <a
href="restfulnlpanalysisservice">RESTful NLP Analyses Service</a> and <a
href="restfullangidentservice">RESTful Language Identification Service</a></li>
</ul>
</li>
-<li>Finally a list supported NLP frameworks and languages<ul>
-<li><a href="#integrated-nlp-frameworks">Integrated NLP processing
Frameworks</a> and </li>
-<li><a href="#supported-languages">Supported Languages</a></li>
+<li>Lists of supported NLP frameworks and languages<ul>
+<li><a href="#integrated-nlp-frameworks">Integrated NLP processing
frameworks</a> and </li>
+<li><a href="#supported-languages">Supported languages</a></li>
</ul>
</li>
</ul>
-<p>Additional Information can be found in</p>
-<ul>
-<li>Usage Scenario <a href="/docs/trunk/multilingual.html">Working with
Multiple Languages</a></li>
-</ul>
+<p>Additional Information can be found in the usage scenario about <a
href="/docs/trunk/multilingual.html">working with multiple languages</a></p>
<h2 id="stanbol-natural-language-processing">Stanbol Natural Language
Processing</h2>
-<p>The natural language processing module of the Stanbol Enhancer supports the
usage of the following NLP processing techniques</p>
+<p>The natural language processing module of the Stanbol Enhancer supports the
usage of the following NLP processing techniques:</p>
<ul>
-<li><strong>Language Detection</strong>: As all the following NLP processing
techniques are highly specific to the language of the text it is very important
to correctly detect the language of the analyzed text. Any Stanbol Enhancer
chain that uses NLP requires the</li>
-<li><strong>Sentence Detection</strong>: The detection / extraction of
<em>Sentences</em> from the analyzed text. Sentences are typically used as
'processing units' by Apache Stanbol. If no sentence detection is available for
a language Stanbol will typically process the text as if it would be a single
Sentence.</li>
-<li><strong>Word Tokenization</strong>: The detection of single <em>Words</em>
is required by the Stanbol Enhancer to process text. While this is trivial for
most languages it is a rather complex task for some (e.g. Chinese, Japanese,
Korean). If not otherwise configured Apache Stanbol will use
<em>Whitespaces</em> to tokenize words.</li>
-<li><strong>Part of Speech (POS) Tagging</strong>: This refers to the
annotation of <em>Words</em> with their <em>Lexical Category</em>. For Entity
extraction / linking especially <em>Words</em> with the category <em>Noun</em>
and the sub-category <em>Proper Noun</em> are of special interest. For POS
tagging Stanbol supports both string tags and ontological concepts as defined
by the <a href="http://olia.nlp2rdf.org/">OLIA</a> ontology.</li>
-<li><strong>Chunking</strong>: This refers to the ability to detect groups of
words that belong together. Often tools to also assign a type to such groups.
E.g. a Noun Phrase Detection refers to the extraction of chunks around a Noun.
This functionality helps in the detection of multi-word Entities (e.g. the
White House), but it is also interesting for users that want to collect
information about adjectives used in combination with nouns (e.g. nice holiday,
beautiful city, â¦)</li>
-<li><strong>Named Entity Recognition_ (NER)</strong>: The detection of
<em>Entities</em> in an analyzed text. Such entities can consist of multiple
words and typically do have a type assigned. Typical detectable types include
<em>Poerson</em>, <em>Organization</em> and <em>Places</em> however most
frameworks allow users to train models for additional domain specific
types.</li>
-<li><strong>Lemmatization</strong>: Often words in a text are not in the form
how they appear in controlled vocabularies (incl. dictionaries). This might
result in Situations where Entities are not correctly recognized in the text,
because the word of the mention does not match the label in the vocabulary.
Lemmatization help with that as it provides the base form - the <em>Lemma</em>
- for the word as mentioned in the Text.</li>
+<li><strong>Language Detection</strong>: As all the following NLP processing
techniques are highly specific to the language of the text it is very important
to correctly detect the language of the analyzed text.</li>
+<li><strong>Sentence Detection</strong>: Any Stanbol Enhancer chain that uses
NLP requires the detection and extraction of <em>sentences</em> from the
analyzed text. Sentences are typically used as 'processing units' in Stanbol.
If no sentence detection is available for a language, Stanbol will typically
process the text as if it would be a single sentence.</li>
+<li><strong>Word Tokenization</strong>: The detection of single <em>words</em>
is required by the Stanbol Enhancer to process text. While this is trivial for
most languages it is a rather complex task for some eastern languages, e.g.
Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use
<em>whitespaces</em> to tokenize words.</li>
+<li><strong>Part of Speech (POS) Tagging</strong>: This refers to the
annotation of <em>words</em> with their <em>lexical category</em>. For entity
extraction and linking <em>words</em> with the category <em>noun</em> and the
sub-category <em>proper noun</em> are of special interest. For POS tagging
Stanbol supports both string tags and ontological concepts as defined by the <a
href="http://olia.nlp2rdf.org/">OLIA</a> ontology.</li>
+<li><strong>Chunking</strong>: This refers to the ability to detect groups of
words that belong together. Often tools assign a type to such groups. For
example, a noun phrase detection refers to the extraction of chunks around a
noun. This functionality helps in the detection of multi-word entities (e.g.
the White House), but it is also interesting for users that want to collect
information about adjectives used in combination with nouns (e.g. nice holiday,
beautiful city, ...)</li>
+<li><strong>Named Entity Recognition_ (NER)</strong>: The detection of
<em>entities</em> in an analyzed text. Such entities can consist of multiple
words and typically do have an assigned type. Typical detectable types include
<em>persons</em>, <em>organizations</em>, and <em>places</em>. However, most
frameworks allow users to train models for additional domain specific
types.</li>
+<li><strong>Lemmatization</strong>: Often words in a text are not in a form
they would appear in controlled vocabularies (incl. dictionaries). This might
result in Situations where entities are not correctly recognized in the text,
because the found word does not match the label in the vocabulary.
Lemmatization help with that as it provides the base form, known as the
<em>lemma</em>, for a word.</li>
</ul>
<p>Based on those techniques Stanbol supports two text enhancement processes
described in the following two sub sections.</p>
<h3 id="named-entity-linking">Named Entity Linking</h3>
-<p>This chain is based on <em>Named Entity Recognition</em> and than linking
recognized entities with controlled vocabularies. A typical <em>Enhancement
Chain</em> contains the following type of Engines:</p>
+<p>This chain is based on <em>named entity recognition</em> (NER) by linking
recognized entities with controlled vocabularies. A typical enhancement chain
contains the following type of engines:</p>
<ul>
-<li><em>Language Detection</em> (required): The language of the text is needed
to select the correct NLP components for the following processing steps</li>
-<li><em>Sentence Detection</em> (optional): If sentences are detected, than
processing of the later steps is done sentence after sentence what definitely
improves performance and might also improve results.</li>
-<li><em>Word Tokenization</em> (required): The detection of Named Entities is
based on processing Tokens.</li>
-<li><em>Named Entity Recognition</em> (required): The detection of Entities
mentioned in the Text</li>
-<li><em>Named Entity Linking</em> (optional): This steps links Entities
recognized in the Text with Entities defined in a <em>Controlled
Vocabulary</em>.</li>
+<li><em>Language Detection</em> (required): The language of the text is needed
to select the correct NLP components for the following processing steps.</li>
+<li><em>Sentence Detection</em> (optional): If sentences are detected, the
processing of the later steps is done sentence by sentence instead of the whole
text at once. This improves performance and might also improve results.</li>
+<li><em>Word Tokenization</em> (required): The detection of named entities is
based on processed tokens.</li>
+<li><em>Named Entity Recognition</em> (required): The recognition of entities
mentioned in the text.</li>
+<li><em>Named Entity Linking</em> (optional): Links entities recognized in the
text with entities defined in a controlled vocabulary.</li>
</ul>
<h3 id="entity-linking">Entity Linking</h3>
-<p>This chain is based on <em>Part of Speech</em>, <em>Chunking</em> and
<em>Lematization</em> results. It uses those results to lookup <em>Words</em>
in the configured <em>Controlled Vocabulary</em>. A typical Enhacement Chain
contains the following type of Engines:</p>
+<p>This chain is based on <em>part of speech</em>, <em>chunking</em> and
<em>lematization</em> analysis. It uses those results to lookup words in a
configured controlled vocabulary. A typical enhacement chain contains the
following type of engines:</p>
<ul>
-<li><em>Language Detection</em> (required): The language of the text is needed
to select the correct NLP components for the following processing steps</li>
-<li><em>Sentence Detection</em> (optional): If sentences are detected, than
processing of the later steps is done sentence after sentence what definitely
improves performance and might also improve results.</li>
-<li><em>Word Tokenization</em> (required): The detection of Named Entities is
based on processing Tokens.</li>
-<li><em>Part of Speech</em> (optional): The POS tag of words is used to decide
if it should be linked with the Vocabulary or not. Linked <em>Lexical
Categories</em> are configurable but typically only <em>Proper Nouns</em> or
all <em>Nouns</em> are linked.</li>
-<li><em>Noun Phrase Detection</em> (optional): If <em>Chunking</em> of
<em>Nouns</em> is supported those information are used to improve linking of
multi-word Entities. E.g. two <em>Common Nouns</em> within the same <em>Noun
Phrase</em> are considered as <em>Proper Noun</em>.</li>
-<li><em>Lemmatization</em> (optional): If configured the Lemma can be used
instead of the <em>Word</em> as mentioned in the text for linking against the
controlled vocabulary.</li>
-<li><em>Entity Linking</em> (required): Entity Linking consumes all the above
NLP processing results and uses them to link <em>Entities</em> contained in the
configured <em>Controlled Vocabulary</em> with <em>Words</em> in the text. This
process requires (as a minimum) a correct <em>Tokenization</em> of the Text,
but is considerable improved by <em>POS</em> annotations of <em>Proper
Nouns</em> and <em>Nouns</em>. <em>Chunking</em> and <em>Lemmatization</em> may
further improve results, but their influence on the quality of results is not
as big as of the <em>POS</em> tagging.</li>
+<li><em>Language Detection</em> (required): The language of the text is needed
to select the correct NLP components for the following processing steps.</li>
+<li><em>Sentence Detection</em> (optional): If sentences are detected, the
processing of the later steps is done sentence by sentence instead of the whole
text at once. This improves performance and might also improve results.</li>
+<li><em>Word Tokenization</em> (required): The detection of named entities is
based on processed tokens.</li>
+<li><em>Part of Speech</em> (optional): The POS tag of words is used to decide
if it should be linked with the vocabulary or not. Linked <em>lexical
categories</em> are configurable but typically only <em>proper nouns</em> or
all <em>nouns</em> are linked.</li>
+<li><em>Noun Phrase Detection</em> (optional): If <em>chunking</em> of
<em>nouns</em> is supported those information are used to improve linking of
multi-word entities. For example, two <em>common nouns</em> within the same
<em>noun phrase</em> are considered as a <em>proper noun</em>.</li>
+<li><em>Lemmatization</em> (optional): If configured the lemma can be used
instead of the word as mentioned in the text for linking against the controlled
vocabulary.</li>
+<li><em>Entity Linking</em> (required): Entity linking consumes all the above
NLP processing results and uses them to link entities contained in the
configured controlled vocabulary with words in the text. This process requires
(as a minimum) a correct <em>tokenization</em> of the text. It is considerable
improved by POS annotations of proper nouns and nouns. Chunking and
lemmatization may further improve results but their influence on the quality of
results is not as big as of the POS tagging.</li>
</ul>
-<p>Additional information on how to configure the Apache Stanbol in
multilingual environments are given by the Usage Scenarios <a
href="/docs/trunk/multilingual.html">Working with Multiple Languages</a>.</p>
+<p>Additional information on how to configure the Stanbol in multilingual
environments are given by the usage scenarios on <a
href="/docs/trunk/multilingual.html">working with multiple languages</a>.</p>
<h2 id="nlp-processing-api">NLP processing API</h2>
-<p>The intension of the Stanbol NLP processing API was to efficiently handle
word level NLP processing annotations. Something that was not possible by using
the RDF <a href="../contentitem#metadata-of-the-contentitem">metadata of the
ContentItem</a>. Instead of RDF the NLP processing API defines a JAVA API that
consists of the following two main parts:</p>
+<p>The intention of the Stanbol NLP processing API is to efficiently handle
word level NLP processing annotations. Something that was not possible by using
the RDF <a href="../contentitem#metadata-of-the-contentitem">metadata of the
contentItem</a>. Instead of RDF the NLP processing API defines a JAVA API that
consists of the following two main parts:</p>
<ul>
-<li><strong><a href="analyzedtext">AnalysedText</a></strong>: A data structure
that represent parts of the analyzed text such as <em>Tokens</em>,
<em>Chunks</em>, <em>Sentences</em> and the <em>AnalysedText</em> itself. All
such <em>Spans</em> select an part of the text and are sorted by their natural
order in a <em>NavigateableMap</em>. The <em>AnalysedText</em> instance is
added to the <a href="../contentitem">ContentItem</a> as ContentPart and is
parsed therefore between <a href="../engines">Enhancement Engines</a>. Every
<em>Span</em> of the <em>AnalysedText</em> can be annotated with
<em>Annotations</em>.</li>
-<li><strong><a href="nlpannotations">NLP Annotations</a></strong>: The Stanbol
NLP processing module defines Ontology aligned annotation models for typical
NLP processing results such as Part of Speech tagging, Phrase detection, Named
Entity Recognition, full Morphological Analysis as well as Sentiment tags.
Those annotations can be used to annotate <em>Span</em> contained in the
<em>AnalysedText</em>.</li>
+<li><strong><a href="analyzedtext">Analysed Text</a></strong>: A data
structure that represent parts of the analyzed text such as <em>tokens</em>,
<em>chunks</em>, <em>sentences</em> and the analysed text itself. All such
<em>spans</em> represent parts of the text and are sorted by their natural
order in a <code>NavigateableMap</code>. The <code>AnalysedText</code> instance
is added to the <a href="../contentitem"><code>ContentItem</code></a> as a
<code>ContentPart</code> and is therefore parsed between <a
href="../engines">enhancement engines</a>. Every span of the
<code>AnalysedText</code> can be annotated with <code>Annotations</code>.</li>
+<li><strong><a href="nlpannotations">NLP Annotations</a></strong>: The Stanbol
NLP processing module defines ontology aligned annotation models for typical
NLP processing results such as part of speech tagging, phrase detection, named
entity recognition, full morphological analysis, and sentiment tags. Those
annotations can be used to annotate <code>Span</code> contained in the
<code>AnalysedText</code>.</li>
</ul>
<p>The NLP processing module also provides a default <a
href="inmemoryanalyzedtextimpl">in-memory</a> implementation of all defined
interfaces. This implementation is used as default by the Stanbol Enhancer.</p>
-<p>Finally the NLP processing module also provides:</p>
+<p>Additionally, the NLP processing module provides:</p>
<ul>
-<li>Utilities for <a href="nlpengine">implementing NLP processing
EnhancementEngines</a> and supports the</li>
-<li>JSON serialization and parsing support for AnalysedText including NLP
Annotations. Together with the <a href="../engines/restfulnlpanalysis">RESTful
NLP Analysis Engine</a> this can be used to <a
href="restfulnlpanalysisservice">Integrate NLP Frameworks as RESTful
Services</a></li>
-<li>RESTful service definition for a <a
href="restfullangidentservice">language identification service</a> as well as
the <a href="../engines/restfullangident">RESTful Language Identification
Engine</a>. This allows to integrate language identification features of an NLP
framework in a similar way as the NLP Analyses described above (see <a
href="https://issues.apache.org/jira/browse/STANBOL-894">STANBOL-894</a> for
the Service specification)</li>
+<li>Utilities for <a href="nlpengine">implementing NLP processing enhancement
engines</a>.</li>
+<li>JSON serialization and parsing support for analysed text including NLP
annotations. Together with the <a href="../engines/restfulnlpanalysis">RESTful
NLP analysis engine</a> this can be used to <a
href="restfulnlpanalysisservice">integrate NLP frameworks as RESTful
services</a>.</li>
+<li>RESTful service definition for a <a
href="restfullangidentservice">language identification service</a> as well as
the <a href="../engines/restfullangident">RESTful language identification
engine</a>. This allows to integrate language identification features of an NLP
framework in a similar way as the NLP analysis described above (see <a
href="https://issues.apache.org/jira/browse/STANBOL-894">STANBOL-894</a> for
the service specification).</li>
</ul>
<h2 id="stanbol-enhancer-nlp-support">Stanbol Enhancer NLP Support</h2>
<p>This section provides an overview about the currently integrated NLP
frameworks and their supported languages.</p>