index.mdtext

fchrist Wed, 30 Jan 2013 06:50:36 -0800

Author: fchrist
Date: Wed Jan 30 14:50:07 2013
New Revision: 1440437

URL: http://svn.apache.org/viewvc?rev=1440437&view=rev
Log:
Minor corrections


Modified:
    stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext

Modified: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext?rev=1440437&r1=1440436&r2=1440437&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext 
(original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext 
Wed Jan 30 14:50:07 2013
@@ -1,6 +1,6 @@
 title: Stanbol Enhancer Natural Language Processing Support 
 
-__NOTE:__ The NLP processing module for the Stanbol Enhancer was introduced by 
[STANBOL-733](https://issues.apache.org/jira/browse/STANBOL-733) and is only 
available to Stanbol Enhancer starting from version <code>0.10.0</code>
+__NOTE:__ The NLP processing module for the Apache Stanbol Enhancer was 
introduced in [STANBOL-733](https://issues.apache.org/jira/browse/STANBOL-733) 
and is only available in Apache Stanbol Enhancer versions starting from 
<code>0.10.0</code>.
 
 Overview:
 ---------
@@ -8,72 +8,70 @@ Overview:
 This section covers the following topics:
 
 * [Stanbol Natural Language Processing](#stanbol-natural-language-processing): 
Short introduction to NLP techniques used by the Stanbol Enahncer
-* The [NLP processing API](#nlp-processing-api): Information about the Java 
API of the NLP processing Framework including information on
-    * How to implement an [NLP EnhancementEngine](nlpengine) and
-    * How to integrate NLP frameworks as a [RESTful NLP Analyses 
Service](restfulnlpanalysisservice) and [RESTful Language Identification 
Service](restfullangidentservice)
-* Finally a list supported NLP frameworks and languages
-    * [Integrated NLP processing Frameworks](#integrated-nlp-frameworks) and 
-    * [Supported Languages](#supported-languages)
+* The [NLP processing API](#nlp-processing-api): Information about the Java 
API of the NLP processing framework including information on
+    * how to implement an [NLP Enhancement Engine](nlpengine) and
+    * how to integrate third party NLP frameworks as a [RESTful NLP Analyses 
Service](restfulnlpanalysisservice) and [RESTful Language Identification 
Service](restfullangidentservice)
+* Lists of supported NLP frameworks and languages
+    * [Integrated NLP processing frameworks](#integrated-nlp-frameworks) and 
+    * [Supported languages](#supported-languages)
 
-Additional Information can be found in
-
-* Usage Scenario [Working with Multiple 
Languages](/docs/trunk/multilingual.html)
+Additional Information can be found in the usage scenario about [working with 
multiple languages](/docs/trunk/multilingual.html)
 
 Stanbol Natural Language Processing
 -----------------------------------
 
-The natural language processing module of the Stanbol Enhancer supports the 
usage of the following NLP processing techniques
+The natural language processing module of the Stanbol Enhancer supports the 
usage of the following NLP processing techniques:
 
-* __Language Detection__: As all the following NLP processing techniques are 
highly specific to the language of the text it is very important to correctly 
detect the language of the analyzed text. Any Stanbol Enhancer chain that uses 
NLP requires the
-* __Sentence Detection__: The detection / extraction of _Sentences_ from the 
analyzed text. Sentences are typically used as 'processing units' by Apache 
Stanbol. If no sentence detection is available for a language Stanbol will 
typically process the text as if it would be a single Sentence.
-* __Word Tokenization__: The detection of single _Words_ is required by the 
Stanbol Enhancer to process text. While this is trivial for most languages it 
is a rather complex task for some (e.g. Chinese, Japanese, Korean). If not 
otherwise configured Apache Stanbol will use _Whitespaces_ to tokenize words.
-* __Part of Speech (POS) Tagging__: This refers to the annotation of _Words_ 
with their _Lexical Category_. For Entity extraction / linking especially 
_Words_ with the category _Noun_ and the sub-category _Proper Noun_ are of 
special interest. For POS tagging Stanbol supports both string tags and 
ontological concepts as defined by the [OLIA](http://olia.nlp2rdf.org/) 
ontology.
-* __Chunking__: This refers to the ability to detect groups of words that 
belong together.  Often tools to also assign a type to such groups. E.g. a Noun 
Phrase Detection refers to the extraction of chunks around a Noun. This 
functionality helps in the detection of multi-word Entities (e.g. the White 
House), but it is also interesting for users that want to collect information 
about adjectives used in combination with nouns (e.g. nice holiday, beautiful 
city, â¦)
-* __Named Entity Recognition_ (NER)__: The detection of _Entities_ in an 
analyzed text. Such entities can consist of multiple words and typically do 
have a type assigned. Typical  detectable types include _Poerson_, 
_Organization_ and _Places_ however most frameworks allow users to train models 
for additional domain specific types.
-* __Lemmatization__: Often words in a text are not in the form how they appear 
in controlled vocabularies (incl. dictionaries). This might result in 
Situations where Entities are not correctly recognized in the text, because the 
word of the mention does not match the label in the vocabulary. Lemmatization 
help with that as it provides the base form - the _Lemma_ - for the word as 
mentioned in the Text.
+* __Language Detection__: As all the following NLP processing techniques are 
highly specific to the language of the text it is very important to correctly 
detect the language of the analyzed text.
+* __Sentence Detection__: Any Stanbol Enhancer chain that uses NLP requires 
the detection and extraction of _sentences_ from the analyzed text. Sentences 
are typically used as 'processing units' in Stanbol. If no sentence detection 
is available for a language, Stanbol will typically process the text as if it 
would be a single sentence.
+* __Word Tokenization__: The detection of single _words_ is required by the 
Stanbol Enhancer to process text. While this is trivial for most languages it 
is a rather complex task for some eastern languages, e.g. Chinese, Japanese, 
Korean. If not otherwise configured, Stanbol will use _whitespaces_ to tokenize 
words.
+* __Part of Speech (POS) Tagging__: This refers to the annotation of _words_ 
with their _lexical category_. For entity extraction and linking _words_ with 
the category _noun_ and the sub-category _proper noun_ are of special interest. 
For POS tagging Stanbol supports both string tags and ontological concepts as 
defined by the [OLIA](http://olia.nlp2rdf.org/) ontology.
+* __Chunking__: This refers to the ability to detect groups of words that 
belong together.  Often tools assign a type to such groups. For example, a noun 
phrase detection refers to the extraction of chunks around a noun. This 
functionality helps in the detection of multi-word entities (e.g. the White 
House), but it is also interesting for users that want to collect information 
about adjectives used in combination with nouns (e.g. nice holiday, beautiful 
city, ...)
+* __Named Entity Recognition_ (NER)__: The detection of _entities_ in an 
analyzed text. Such entities can consist of multiple words and typically do 
have an assigned type. Typical detectable types include _persons_, 
_organizations_, and _places_. However, most frameworks allow users to train 
models for additional domain specific types.
+* __Lemmatization__: Often words in a text are not in a form they would appear 
in controlled vocabularies (incl. dictionaries). This might result in 
Situations where entities are not correctly recognized in the text, because the 
found word does not match the label in the vocabulary. Lemmatization help with 
that as it provides the base form, known as the _lemma_, for a word.
 
 Based on those techniques Stanbol supports two text enhancement processes 
described in the following two sub sections.
 
 ### Named Entity Linking
 
-This chain is based on _Named Entity Recognition_ and than linking recognized 
entities with controlled vocabularies. A typical _Enhancement Chain_ contains 
the following type of Engines:
+This chain is based on _named entity recognition_ (NER) by linking recognized 
entities with controlled vocabularies. A typical enhancement chain contains the 
following type of engines:
 
-* _Language Detection_ (required): The language of the text is needed to 
select the correct NLP components for the following processing steps
-* _Sentence Detection_ (optional): If sentences are detected, than processing 
of the later steps is done sentence after sentence what definitely improves 
performance and might also improve results.
-* _Word Tokenization_ (required): The detection of Named Entities is based on 
processing Tokens.
-* _Named Entity Recognition_ (required): The detection of Entities mentioned 
in the Text
-* _Named Entity Linking_ (optional): This steps links Entities recognized in 
the Text with Entities defined in a _Controlled Vocabulary_.
+* _Language Detection_ (required): The language of the text is needed to 
select the correct NLP components for the following processing steps.
+* _Sentence Detection_ (optional): If sentences are detected, the processing 
of the later steps is done sentence by sentence instead of the whole text at 
once. This improves performance and might also improve results.
+* _Word Tokenization_ (required): The detection of named entities is based on 
processed tokens.
+* _Named Entity Recognition_ (required): The recognition of entities mentioned 
in the text.
+* _Named Entity Linking_ (optional): Links entities recognized in the text 
with entities defined in a controlled vocabulary.
 
 ### Entity Linking 
 
-This chain is based on _Part of Speech_, _Chunking_ and _Lematization_ 
results. It uses those results to lookup _Words_ in the configured _Controlled 
Vocabulary_. A typical Enhacement Chain contains the following type of Engines:
+This chain is based on _part of speech_, _chunking_ and _lematization_ 
analysis. It uses those results to lookup words in a configured controlled 
vocabulary. A typical enhacement chain contains the following type of engines:
 
-* _Language Detection_ (required): The language of the text is needed to 
select the correct NLP components for the following processing steps
-* _Sentence Detection_ (optional): If sentences are detected, than processing 
of the later steps is done sentence after sentence what definitely improves 
performance and might also improve results.
-* _Word Tokenization_ (required): The detection of Named Entities is based on 
processing Tokens.
-* _Part of Speech_ (optional): The POS tag of words is used to decide if it 
should be linked with the Vocabulary or not. Linked _Lexical Categories_ are 
configurable but typically only _Proper Nouns_ or all _Nouns_ are linked.
-* _Noun Phrase Detection_ (optional): If _Chunking_ of _Nouns_ is supported 
those information are used to improve linking of multi-word Entities. E.g. two 
_Common Nouns_ within the same _Noun Phrase_ are considered as _Proper Noun_.
-* _Lemmatization_ (optional): If configured the Lemma can be used instead of 
the _Word_ as mentioned in the text for linking against the controlled 
vocabulary.
-* _Entity Linking_ (required): Entity Linking consumes all the above NLP 
processing results and uses them to link _Entities_ contained in the configured 
_Controlled Vocabulary_ with _Words_ in the text. This process requires (as a 
minimum) a correct _Tokenization_ of the Text, but is considerable improved by 
_POS_ annotations of _Proper Nouns_ and _Nouns_. _Chunking_ and _Lemmatization_ 
may further improve results, but their influence on the quality of results is 
not as big as of the _POS_ tagging.
+* _Language Detection_ (required): The language of the text is needed to 
select the correct NLP components for the following processing steps.
+* _Sentence Detection_ (optional): If sentences are detected, the processing 
of the later steps is done sentence by sentence instead of the whole text at 
once. This improves performance and might also improve results.
+* _Word Tokenization_ (required): The detection of named entities is based on 
processed tokens.
+* _Part of Speech_ (optional): The POS tag of words is used to decide if it 
should be linked with the vocabulary or not. Linked _lexical categories_ are 
configurable but typically only _proper nouns_ or all _nouns_ are linked.
+* _Noun Phrase Detection_ (optional): If _chunking_ of _nouns_ is supported 
those information are used to improve linking of multi-word entities. For 
example, two _common nouns_ within the same _noun phrase_ are considered as a 
_proper noun_.
+* _Lemmatization_ (optional): If configured the lemma can be used instead of 
the word as mentioned in the text for linking against the controlled vocabulary.
+* _Entity Linking_ (required): Entity linking consumes all the above NLP 
processing results and uses them to link entities contained in the configured 
controlled vocabulary with words in the text. This process requires (as a 
minimum) a correct _tokenization_ of the text. It is considerable improved by 
POS annotations of proper nouns and nouns. Chunking and lemmatization may 
further improve results but their influence on the quality of results is not as 
big as of the POS tagging.
 
-Additional information on how to configure the Apache Stanbol in multilingual 
environments are given by the Usage Scenarios [Working with Multiple 
Languages](/docs/trunk/multilingual.html).
+Additional information on how to configure the Stanbol in multilingual 
environments are given by the usage scenarios on [working with multiple 
languages](/docs/trunk/multilingual.html).
 
 
 NLP processing API
 ------------------
 
-The intension of the Stanbol NLP processing API was to efficiently handle word 
level NLP processing annotations. Something that was not possible by using the 
RDF [metadata of the ContentItem](../contentitem#metadata-of-the-contentitem). 
Instead of RDF the NLP processing API defines a JAVA API that consists of the 
following two main parts:
+The intention of the Stanbol NLP processing API is to efficiently handle word 
level NLP processing annotations. Something that was not possible by using the 
RDF [metadata of the contentItem](../contentitem#metadata-of-the-contentitem). 
Instead of RDF the NLP processing API defines a JAVA API that consists of the 
following two main parts:
 
-* __[AnalysedText](analyzedtext)__: A data structure that represent parts of 
the analyzed text such as _Tokens_, _Chunks_, _Sentences_ and the 
_AnalysedText_ itself. All such _Spans_ select an part of the text and are 
sorted by their natural order in a _NavigateableMap_. The _AnalysedText_ 
instance is added to the [ContentItem](../contentitem) as ContentPart and is 
parsed therefore between [Enhancement Engines](../engines). Every _Span_ of the 
_AnalysedText_ can be annotated with _Annotations_.
-* __[NLP Annotations](nlpannotations)__: The Stanbol NLP processing module 
defines Ontology aligned annotation models for typical NLP processing results 
such as Part of Speech tagging, Phrase detection, Named Entity Recognition, 
full Morphological Analysis as well as Sentiment tags. Those annotations can be 
used to annotate _Span_ contained in the _AnalysedText_.
+* __[Analysed Text](analyzedtext)__: A data structure that represent parts of 
the analyzed text such as _tokens_, _chunks_, _sentences_ and the analysed text 
itself. All such _spans_ represent parts of the text and are sorted by their 
natural order in a `NavigateableMap`. The `AnalysedText` instance is added to 
the [`ContentItem`](../contentitem) as a `ContentPart` and is therefore parsed 
between [enhancement engines](../engines). Every span of the `AnalysedText` can 
be annotated with `Annotations`.
+* __[NLP Annotations](nlpannotations)__: The Stanbol NLP processing module 
defines ontology aligned annotation models for typical NLP processing results 
such as part of speech tagging, phrase detection, named entity recognition, 
full morphological analysis, and sentiment tags. Those annotations can be used 
to annotate `Span` contained in the `AnalysedText`.
 
 The NLP processing module also provides a default 
[in-memory](inmemoryanalyzedtextimpl) implementation of all defined interfaces. 
This implementation is used as default by the Stanbol Enhancer.
 
-Finally the NLP processing module also provides:
+Additionally, the NLP processing module provides:
 
-* Utilities for [implementing NLP processing EnhancementEngines](nlpengine) 
and supports the
-* JSON serialization and parsing support for AnalysedText including NLP 
Annotations. Together with the [RESTful NLP Analysis 
Engine](../engines/restfulnlpanalysis) this can be used to [Integrate NLP 
Frameworks as RESTful Services](restfulnlpanalysisservice)
-* RESTful service definition for a [language identification 
service](restfullangidentservice) as well as the [RESTful Language 
Identification Engine](../engines/restfullangident). This allows to integrate 
language identification features of an NLP framework in a similar way as the 
NLP Analyses described above (see 
[STANBOL-894](https://issues.apache.org/jira/browse/STANBOL-894) for the 
Service specification)
+* Utilities for [implementing NLP processing enhancement engines](nlpengine).
+* JSON serialization and parsing support for analysed text including NLP 
annotations. Together with the [RESTful NLP analysis 
engine](../engines/restfulnlpanalysis) this can be used to [integrate NLP 
frameworks as RESTful services](restfulnlpanalysisservice).
+* RESTful service definition for a [language identification 
service](restfullangidentservice) as well as the [RESTful language 
identification engine](../engines/restfullangident). This allows to integrate 
language identification features of an NLP framework in a similar way as the 
NLP analysis described above (see 
[STANBOL-894](https://issues.apache.org/jira/browse/STANBOL-894) for the 
service specification).
 
 Stanbol Enhancer NLP Support
 ----------------------------

svn commit: r1440437 - /stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext

Reply via email to