Author: buildbot
Date: Wed Jan 30 14:50:14 2013
New Revision: 848597

Log:
Staging update by buildbot for stanbol

Modified:
    websites/staging/stanbol/trunk/content/   (props changed)
    
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html

Propchange: websites/staging/stanbol/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Jan 30 14:50:14 2013
@@ -1 +1 @@
-1440412
+1440437

Modified: 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
==============================================================================
--- 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
 (original)
+++ 
websites/staging/stanbol/trunk/content/docs/trunk/components/enhancer/nlp/index.html
 Wed Jan 30 14:50:14 2013
@@ -87,71 +87,68 @@
       <ul> <li><a href="/">Home</a></li> <li class="item"><a 
href="/docs/">Docs</a></li> <li class="item"><a 
href="/docs/trunk/">Trunk</a></li> <li class="item"><a 
href="/docs/trunk/components/">Components</a></li> <li class="item"><a 
href="/docs/trunk/components/enhancer/">Enhancer</a></li> <li class="item"><a 
href="/docs/trunk/components/enhancer/nlp/">Nlp</a></li> </ul>
     </div>
     <h1 class="title">Stanbol Enhancer Natural Language Processing Support 
</h1>
-    <p><strong>NOTE:</strong> The NLP processing module for the Stanbol 
Enhancer was introduced by <a 
href="https://issues.apache.org/jira/browse/STANBOL-733";>STANBOL-733</a> and is 
only available to Stanbol Enhancer starting from version <code>0.10.0</code></p>
+    <p><strong>NOTE:</strong> The NLP processing module for the Apache Stanbol 
Enhancer was introduced in <a 
href="https://issues.apache.org/jira/browse/STANBOL-733";>STANBOL-733</a> and is 
only available in Apache Stanbol Enhancer versions starting from 
<code>0.10.0</code>.</p>
 <h2 id="overview">Overview:</h2>
 <p>This section covers the following topics:</p>
 <ul>
 <li><a href="#stanbol-natural-language-processing">Stanbol Natural Language 
Processing</a>: Short introduction to NLP techniques used by the Stanbol 
Enahncer</li>
-<li>The <a href="#nlp-processing-api">NLP processing API</a>: Information 
about the Java API of the NLP processing Framework including information on<ul>
-<li>How to implement an <a href="nlpengine">NLP EnhancementEngine</a> and</li>
-<li>How to integrate NLP frameworks as a <a 
href="restfulnlpanalysisservice">RESTful NLP Analyses Service</a> and <a 
href="restfullangidentservice">RESTful Language Identification Service</a></li>
+<li>The <a href="#nlp-processing-api">NLP processing API</a>: Information 
about the Java API of the NLP processing framework including information on<ul>
+<li>how to implement an <a href="nlpengine">NLP Enhancement Engine</a> and</li>
+<li>how to integrate third party NLP frameworks as a <a 
href="restfulnlpanalysisservice">RESTful NLP Analyses Service</a> and <a 
href="restfullangidentservice">RESTful Language Identification Service</a></li>
 </ul>
 </li>
-<li>Finally a list supported NLP frameworks and languages<ul>
-<li><a href="#integrated-nlp-frameworks">Integrated NLP processing 
Frameworks</a> and </li>
-<li><a href="#supported-languages">Supported Languages</a></li>
+<li>Lists of supported NLP frameworks and languages<ul>
+<li><a href="#integrated-nlp-frameworks">Integrated NLP processing 
frameworks</a> and </li>
+<li><a href="#supported-languages">Supported languages</a></li>
 </ul>
 </li>
 </ul>
-<p>Additional Information can be found in</p>
-<ul>
-<li>Usage Scenario <a href="/docs/trunk/multilingual.html">Working with 
Multiple Languages</a></li>
-</ul>
+<p>Additional Information can be found in the usage scenario about <a 
href="/docs/trunk/multilingual.html">working with multiple languages</a></p>
 <h2 id="stanbol-natural-language-processing">Stanbol Natural Language 
Processing</h2>
-<p>The natural language processing module of the Stanbol Enhancer supports the 
usage of the following NLP processing techniques</p>
+<p>The natural language processing module of the Stanbol Enhancer supports the 
usage of the following NLP processing techniques:</p>
 <ul>
-<li><strong>Language Detection</strong>: As all the following NLP processing 
techniques are highly specific to the language of the text it is very important 
to correctly detect the language of the analyzed text. Any Stanbol Enhancer 
chain that uses NLP requires the</li>
-<li><strong>Sentence Detection</strong>: The detection / extraction of 
<em>Sentences</em> from the analyzed text. Sentences are typically used as 
'processing units' by Apache Stanbol. If no sentence detection is available for 
a language Stanbol will typically process the text as if it would be a single 
Sentence.</li>
-<li><strong>Word Tokenization</strong>: The detection of single <em>Words</em> 
is required by the Stanbol Enhancer to process text. While this is trivial for 
most languages it is a rather complex task for some (e.g. Chinese, Japanese, 
Korean). If not otherwise configured Apache Stanbol will use 
<em>Whitespaces</em> to tokenize words.</li>
-<li><strong>Part of Speech (POS) Tagging</strong>: This refers to the 
annotation of <em>Words</em> with their <em>Lexical Category</em>. For Entity 
extraction / linking especially <em>Words</em> with the category <em>Noun</em> 
and the sub-category <em>Proper Noun</em> are of special interest. For POS 
tagging Stanbol supports both string tags and ontological concepts as defined 
by the <a href="http://olia.nlp2rdf.org/";>OLIA</a> ontology.</li>
-<li><strong>Chunking</strong>: This refers to the ability to detect groups of 
words that belong together.  Often tools to also assign a type to such groups. 
E.g. a Noun Phrase Detection refers to the extraction of chunks around a Noun. 
This functionality helps in the detection of multi-word Entities (e.g. the 
White House), but it is also interesting for users that want to collect 
information about adjectives used in combination with nouns (e.g. nice holiday, 
beautiful city, …)</li>
-<li><strong>Named Entity Recognition_ (NER)</strong>: The detection of 
<em>Entities</em> in an analyzed text. Such entities can consist of multiple 
words and typically do have a type assigned. Typical  detectable types include 
<em>Poerson</em>, <em>Organization</em> and <em>Places</em> however most 
frameworks allow users to train models for additional domain specific 
types.</li>
-<li><strong>Lemmatization</strong>: Often words in a text are not in the form 
how they appear in controlled vocabularies (incl. dictionaries). This might 
result in Situations where Entities are not correctly recognized in the text, 
because the word of the mention does not match the label in the vocabulary. 
Lemmatization help with that as it provides the base form - the <em>Lemma</em> 
- for the word as mentioned in the Text.</li>
+<li><strong>Language Detection</strong>: As all the following NLP processing 
techniques are highly specific to the language of the text it is very important 
to correctly detect the language of the analyzed text.</li>
+<li><strong>Sentence Detection</strong>: Any Stanbol Enhancer chain that uses 
NLP requires the detection and extraction of <em>sentences</em> from the 
analyzed text. Sentences are typically used as 'processing units' in Stanbol. 
If no sentence detection is available for a language, Stanbol will typically 
process the text as if it would be a single sentence.</li>
+<li><strong>Word Tokenization</strong>: The detection of single <em>words</em> 
is required by the Stanbol Enhancer to process text. While this is trivial for 
most languages it is a rather complex task for some eastern languages, e.g. 
Chinese, Japanese, Korean. If not otherwise configured, Stanbol will use 
<em>whitespaces</em> to tokenize words.</li>
+<li><strong>Part of Speech (POS) Tagging</strong>: This refers to the 
annotation of <em>words</em> with their <em>lexical category</em>. For entity 
extraction and linking <em>words</em> with the category <em>noun</em> and the 
sub-category <em>proper noun</em> are of special interest. For POS tagging 
Stanbol supports both string tags and ontological concepts as defined by the <a 
href="http://olia.nlp2rdf.org/";>OLIA</a> ontology.</li>
+<li><strong>Chunking</strong>: This refers to the ability to detect groups of 
words that belong together.  Often tools assign a type to such groups. For 
example, a noun phrase detection refers to the extraction of chunks around a 
noun. This functionality helps in the detection of multi-word entities (e.g. 
the White House), but it is also interesting for users that want to collect 
information about adjectives used in combination with nouns (e.g. nice holiday, 
beautiful city, ...)</li>
+<li><strong>Named Entity Recognition_ (NER)</strong>: The detection of 
<em>entities</em> in an analyzed text. Such entities can consist of multiple 
words and typically do have an assigned type. Typical detectable types include 
<em>persons</em>, <em>organizations</em>, and <em>places</em>. However, most 
frameworks allow users to train models for additional domain specific 
types.</li>
+<li><strong>Lemmatization</strong>: Often words in a text are not in a form 
they would appear in controlled vocabularies (incl. dictionaries). This might 
result in Situations where entities are not correctly recognized in the text, 
because the found word does not match the label in the vocabulary. 
Lemmatization help with that as it provides the base form, known as the 
<em>lemma</em>, for a word.</li>
 </ul>
 <p>Based on those techniques Stanbol supports two text enhancement processes 
described in the following two sub sections.</p>
 <h3 id="named-entity-linking">Named Entity Linking</h3>
-<p>This chain is based on <em>Named Entity Recognition</em> and than linking 
recognized entities with controlled vocabularies. A typical <em>Enhancement 
Chain</em> contains the following type of Engines:</p>
+<p>This chain is based on <em>named entity recognition</em> (NER) by linking 
recognized entities with controlled vocabularies. A typical enhancement chain 
contains the following type of engines:</p>
 <ul>
-<li><em>Language Detection</em> (required): The language of the text is needed 
to select the correct NLP components for the following processing steps</li>
-<li><em>Sentence Detection</em> (optional): If sentences are detected, than 
processing of the later steps is done sentence after sentence what definitely 
improves performance and might also improve results.</li>
-<li><em>Word Tokenization</em> (required): The detection of Named Entities is 
based on processing Tokens.</li>
-<li><em>Named Entity Recognition</em> (required): The detection of Entities 
mentioned in the Text</li>
-<li><em>Named Entity Linking</em> (optional): This steps links Entities 
recognized in the Text with Entities defined in a <em>Controlled 
Vocabulary</em>.</li>
+<li><em>Language Detection</em> (required): The language of the text is needed 
to select the correct NLP components for the following processing steps.</li>
+<li><em>Sentence Detection</em> (optional): If sentences are detected, the 
processing of the later steps is done sentence by sentence instead of the whole 
text at once. This improves performance and might also improve results.</li>
+<li><em>Word Tokenization</em> (required): The detection of named entities is 
based on processed tokens.</li>
+<li><em>Named Entity Recognition</em> (required): The recognition of entities 
mentioned in the text.</li>
+<li><em>Named Entity Linking</em> (optional): Links entities recognized in the 
text with entities defined in a controlled vocabulary.</li>
 </ul>
 <h3 id="entity-linking">Entity Linking</h3>
-<p>This chain is based on <em>Part of Speech</em>, <em>Chunking</em> and 
<em>Lematization</em> results. It uses those results to lookup <em>Words</em> 
in the configured <em>Controlled Vocabulary</em>. A typical Enhacement Chain 
contains the following type of Engines:</p>
+<p>This chain is based on <em>part of speech</em>, <em>chunking</em> and 
<em>lematization</em> analysis. It uses those results to lookup words in a 
configured controlled vocabulary. A typical enhacement chain contains the 
following type of engines:</p>
 <ul>
-<li><em>Language Detection</em> (required): The language of the text is needed 
to select the correct NLP components for the following processing steps</li>
-<li><em>Sentence Detection</em> (optional): If sentences are detected, than 
processing of the later steps is done sentence after sentence what definitely 
improves performance and might also improve results.</li>
-<li><em>Word Tokenization</em> (required): The detection of Named Entities is 
based on processing Tokens.</li>
-<li><em>Part of Speech</em> (optional): The POS tag of words is used to decide 
if it should be linked with the Vocabulary or not. Linked <em>Lexical 
Categories</em> are configurable but typically only <em>Proper Nouns</em> or 
all <em>Nouns</em> are linked.</li>
-<li><em>Noun Phrase Detection</em> (optional): If <em>Chunking</em> of 
<em>Nouns</em> is supported those information are used to improve linking of 
multi-word Entities. E.g. two <em>Common Nouns</em> within the same <em>Noun 
Phrase</em> are considered as <em>Proper Noun</em>.</li>
-<li><em>Lemmatization</em> (optional): If configured the Lemma can be used 
instead of the <em>Word</em> as mentioned in the text for linking against the 
controlled vocabulary.</li>
-<li><em>Entity Linking</em> (required): Entity Linking consumes all the above 
NLP processing results and uses them to link <em>Entities</em> contained in the 
configured <em>Controlled Vocabulary</em> with <em>Words</em> in the text. This 
process requires (as a minimum) a correct <em>Tokenization</em> of the Text, 
but is considerable improved by <em>POS</em> annotations of <em>Proper 
Nouns</em> and <em>Nouns</em>. <em>Chunking</em> and <em>Lemmatization</em> may 
further improve results, but their influence on the quality of results is not 
as big as of the <em>POS</em> tagging.</li>
+<li><em>Language Detection</em> (required): The language of the text is needed 
to select the correct NLP components for the following processing steps.</li>
+<li><em>Sentence Detection</em> (optional): If sentences are detected, the 
processing of the later steps is done sentence by sentence instead of the whole 
text at once. This improves performance and might also improve results.</li>
+<li><em>Word Tokenization</em> (required): The detection of named entities is 
based on processed tokens.</li>
+<li><em>Part of Speech</em> (optional): The POS tag of words is used to decide 
if it should be linked with the vocabulary or not. Linked <em>lexical 
categories</em> are configurable but typically only <em>proper nouns</em> or 
all <em>nouns</em> are linked.</li>
+<li><em>Noun Phrase Detection</em> (optional): If <em>chunking</em> of 
<em>nouns</em> is supported those information are used to improve linking of 
multi-word entities. For example, two <em>common nouns</em> within the same 
<em>noun phrase</em> are considered as a <em>proper noun</em>.</li>
+<li><em>Lemmatization</em> (optional): If configured the lemma can be used 
instead of the word as mentioned in the text for linking against the controlled 
vocabulary.</li>
+<li><em>Entity Linking</em> (required): Entity linking consumes all the above 
NLP processing results and uses them to link entities contained in the 
configured controlled vocabulary with words in the text. This process requires 
(as a minimum) a correct <em>tokenization</em> of the text. It is considerable 
improved by POS annotations of proper nouns and nouns. Chunking and 
lemmatization may further improve results but their influence on the quality of 
results is not as big as of the POS tagging.</li>
 </ul>
-<p>Additional information on how to configure the Apache Stanbol in 
multilingual environments are given by the Usage Scenarios <a 
href="/docs/trunk/multilingual.html">Working with Multiple Languages</a>.</p>
+<p>Additional information on how to configure the Stanbol in multilingual 
environments are given by the usage scenarios on <a 
href="/docs/trunk/multilingual.html">working with multiple languages</a>.</p>
 <h2 id="nlp-processing-api">NLP processing API</h2>
-<p>The intension of the Stanbol NLP processing API was to efficiently handle 
word level NLP processing annotations. Something that was not possible by using 
the RDF <a href="../contentitem#metadata-of-the-contentitem">metadata of the 
ContentItem</a>. Instead of RDF the NLP processing API defines a JAVA API that 
consists of the following two main parts:</p>
+<p>The intention of the Stanbol NLP processing API is to efficiently handle 
word level NLP processing annotations. Something that was not possible by using 
the RDF <a href="../contentitem#metadata-of-the-contentitem">metadata of the 
contentItem</a>. Instead of RDF the NLP processing API defines a JAVA API that 
consists of the following two main parts:</p>
 <ul>
-<li><strong><a href="analyzedtext">AnalysedText</a></strong>: A data structure 
that represent parts of the analyzed text such as <em>Tokens</em>, 
<em>Chunks</em>, <em>Sentences</em> and the <em>AnalysedText</em> itself. All 
such <em>Spans</em> select an part of the text and are sorted by their natural 
order in a <em>NavigateableMap</em>. The <em>AnalysedText</em> instance is 
added to the <a href="../contentitem">ContentItem</a> as ContentPart and is 
parsed therefore between <a href="../engines">Enhancement Engines</a>. Every 
<em>Span</em> of the <em>AnalysedText</em> can be annotated with 
<em>Annotations</em>.</li>
-<li><strong><a href="nlpannotations">NLP Annotations</a></strong>: The Stanbol 
NLP processing module defines Ontology aligned annotation models for typical 
NLP processing results such as Part of Speech tagging, Phrase detection, Named 
Entity Recognition, full Morphological Analysis as well as Sentiment tags. 
Those annotations can be used to annotate <em>Span</em> contained in the 
<em>AnalysedText</em>.</li>
+<li><strong><a href="analyzedtext">Analysed Text</a></strong>: A data 
structure that represent parts of the analyzed text such as <em>tokens</em>, 
<em>chunks</em>, <em>sentences</em> and the analysed text itself. All such 
<em>spans</em> represent parts of the text and are sorted by their natural 
order in a <code>NavigateableMap</code>. The <code>AnalysedText</code> instance 
is added to the <a href="../contentitem"><code>ContentItem</code></a> as a 
<code>ContentPart</code> and is therefore parsed between <a 
href="../engines">enhancement engines</a>. Every span of the 
<code>AnalysedText</code> can be annotated with <code>Annotations</code>.</li>
+<li><strong><a href="nlpannotations">NLP Annotations</a></strong>: The Stanbol 
NLP processing module defines ontology aligned annotation models for typical 
NLP processing results such as part of speech tagging, phrase detection, named 
entity recognition, full morphological analysis, and sentiment tags. Those 
annotations can be used to annotate <code>Span</code> contained in the 
<code>AnalysedText</code>.</li>
 </ul>
 <p>The NLP processing module also provides a default <a 
href="inmemoryanalyzedtextimpl">in-memory</a> implementation of all defined 
interfaces. This implementation is used as default by the Stanbol Enhancer.</p>
-<p>Finally the NLP processing module also provides:</p>
+<p>Additionally, the NLP processing module provides:</p>
 <ul>
-<li>Utilities for <a href="nlpengine">implementing NLP processing 
EnhancementEngines</a> and supports the</li>
-<li>JSON serialization and parsing support for AnalysedText including NLP 
Annotations. Together with the <a href="../engines/restfulnlpanalysis">RESTful 
NLP Analysis Engine</a> this can be used to <a 
href="restfulnlpanalysisservice">Integrate NLP Frameworks as RESTful 
Services</a></li>
-<li>RESTful service definition for a <a 
href="restfullangidentservice">language identification service</a> as well as 
the <a href="../engines/restfullangident">RESTful Language Identification 
Engine</a>. This allows to integrate language identification features of an NLP 
framework in a similar way as the NLP Analyses described above (see <a 
href="https://issues.apache.org/jira/browse/STANBOL-894";>STANBOL-894</a> for 
the Service specification)</li>
+<li>Utilities for <a href="nlpengine">implementing NLP processing enhancement 
engines</a>.</li>
+<li>JSON serialization and parsing support for analysed text including NLP 
annotations. Together with the <a href="../engines/restfulnlpanalysis">RESTful 
NLP analysis engine</a> this can be used to <a 
href="restfulnlpanalysisservice">integrate NLP frameworks as RESTful 
services</a>.</li>
+<li>RESTful service definition for a <a 
href="restfullangidentservice">language identification service</a> as well as 
the <a href="../engines/restfullangident">RESTful language identification 
engine</a>. This allows to integrate language identification features of an NLP 
framework in a similar way as the NLP analysis described above (see <a 
href="https://issues.apache.org/jira/browse/STANBOL-894";>STANBOL-894</a> for 
the service specification).</li>
 </ul>
 <h2 id="stanbol-enhancer-nlp-support">Stanbol Enhancer NLP Support</h2>
 <p>This section provides an overview about the currently integrated NLP 
frameworks and their supported languages.</p>


Reply via email to