Author: fchrist
Date: Wed Jan 30 14:50:07 2013
New Revision: 1440437
URL: http://svn.apache.org/viewvc?rev=1440437&view=rev
Log:
Minor corrections
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
Modified:
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
URL:
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext?rev=1440437&r1=1440436&r2=1440437&view=diff
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
(original)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
Wed Jan 30 14:50:07 2013
@@ -1,6 +1,6 @@
title: Stanbol Enhancer Natural Language Processing Support
-__NOTE:__ The NLP processing module for the Stanbol Enhancer was introduced by
[STANBOL-733](https://issues.apache.org/jira/browse/STANBOL-733) and is only
available to Stanbol Enhancer starting from version <code>0.10.0</code>
+__NOTE:__ The NLP processing module for the Apache Stanbol Enhancer was
introduced in [STANBOL-733](https://issues.apache.org/jira/browse/STANBOL-733)
and is only available in Apache Stanbol Enhancer versions starting from
<code>0.10.0</code>.
Overview:
---------
@@ -8,72 +8,70 @@ Overview:
This section covers the following topics:
* [Stanbol Natural Language Processing](#stanbol-natural-language-processing):
Short introduction to NLP techniques used by the Stanbol Enahncer
-* The [NLP processing API](#nlp-processing-api): Information about the Java
API of the NLP processing Framework including information on
- * How to implement an [NLP EnhancementEngine](nlpengine) and
- * How to integrate NLP frameworks as a [RESTful NLP Analyses
Service](restfulnlpanalysisservice) and [RESTful Language Identification
Service](restfullangidentservice)
-* Finally a list supported NLP frameworks and languages
- * [Integrated NLP processing Frameworks](#integrated-nlp-frameworks) and
- * [Supported Languages](#supported-languages)
+* The [NLP processing API](#nlp-processing-api): Information about the Java
API of the NLP processing framework including information on
+ * how to implement an [NLP Enhancement Engine](nlpengine) and
+ * how to integrate third party NLP frameworks as a [RESTful NLP Analyses
Service](restfulnlpanalysisservice) and [RESTful Language Identification
Service](restfullangidentservice)
+* Lists of supported NLP frameworks and languages
+ * [Integrated NLP processing frameworks](#integrated-nlp-frameworks) and
+ * [Supported languages](#supported-languages)
-Additional Information can be found in
-
-* Usage Scenario [Working with Multiple
Languages](/docs/trunk/multilingual.html)
+Additional Information can be found in the usage scenario about [working with
multiple languages](/docs/trunk/multilingual.html)
Stanbol Natural Language Processing
-----------------------------------
-The natural language processing module of the Stanbol Enhancer supports the
usage of the following NLP processing techniques
+The natural language processing module of the Stanbol Enhancer supports the
usage of the following NLP processing techniques:
-* __Language Detection__: As all the following NLP processing techniques are
highly specific to the language of the text it is very important to correctly
detect the language of the analyzed text. Any Stanbol Enhancer chain that uses
NLP requires the
-* __Sentence Detection__: The detection / extraction of _Sentences_ from the
analyzed text. Sentences are typically used as 'processing units' by Apache
Stanbol. If no sentence detection is available for a language Stanbol will
typically process the text as if it would be a single Sentence.
-* __Word Tokenization__: The detection of single _Words_ is required by the
Stanbol Enhancer to process text. While this is trivial for most languages it
is a rather complex task for some (e.g. Chinese, Japanese, Korean). If not
otherwise configured Apache Stanbol will use _Whitespaces_ to tokenize words.
-* __Part of Speech (POS) Tagging__: This refers to the annotation of _Words_
with their _Lexical Category_. For Entity extraction / linking especially
_Words_ with the category _Noun_ and the sub-category _Proper Noun_ are of
special interest. For POS tagging Stanbol supports both string tags and
ontological concepts as defined by the [OLIA](http://olia.nlp2rdf.org/)
ontology.
-* __Chunking__: This refers to the ability to detect groups of words that
belong together. Often tools to also assign a type to such groups. E.g. a Noun
Phrase Detection refers to the extraction of chunks around a Noun. This
functionality helps in the detection of multi-word Entities (e.g. the White
House), but it is also interesting for users that want to collect information
about adjectives used in combination with nouns (e.g. nice holiday, beautiful
city, â¦)
-* __Named Entity Recognition_ (NER)__: The detection of _Entities_ in an
analyzed text. Such entities can consist of multiple words and typically do
have a type assigned. Typical detectable types include _Poerson_,
_Organization_ and _Places_ however most frameworks allow users to train models
for additional domain specific types.
-* __Lemmatization__: Often words in a text are not in the form how they appear
in controlled vocabularies (incl. dictionaries). This might result in
Situations where Entities are not correctly recognized in the text, because the
word of the mention does not match the label in the vocabulary. Lemmatization
help with that as it provides the base form - the _Lemma_ - for the word as
mentioned in the Text.
+* __Language Detection__: As all the following NLP processing techniques are
highly specific to the language of the text it is very important to correctly
detect the language of the analyzed text.
+* __Sentence Detection__: Any Stanbol Enhancer chain that uses NLP requires
the detection and extraction of _sentences_ from the analyzed text. Sentences
are typically used as 'processing units' in Stanbol. If no sentence detection
is available for a language, Stanbol will typically process the text as if it
would be a single sentence.
+* __Word Tokenization__: The detection of single _words_ is required by the
Stanbol Enhancer to process text. While this is trivial for most languages it
is a rather complex task for some eastern languages, e.g. Chinese, Japanese,
Korean. If not otherwise configured, Stanbol will use _whitespaces_ to tokenize
words.
+* __Part of Speech (POS) Tagging__: This refers to the annotation of _words_
with their _lexical category_. For entity extraction and linking _words_ with
the category _noun_ and the sub-category _proper noun_ are of special interest.
For POS tagging Stanbol supports both string tags and ontological concepts as
defined by the [OLIA](http://olia.nlp2rdf.org/) ontology.
+* __Chunking__: This refers to the ability to detect groups of words that
belong together. Often tools assign a type to such groups. For example, a noun
phrase detection refers to the extraction of chunks around a noun. This
functionality helps in the detection of multi-word entities (e.g. the White
House), but it is also interesting for users that want to collect information
about adjectives used in combination with nouns (e.g. nice holiday, beautiful
city, ...)
+* __Named Entity Recognition_ (NER)__: The detection of _entities_ in an
analyzed text. Such entities can consist of multiple words and typically do
have an assigned type. Typical detectable types include _persons_,
_organizations_, and _places_. However, most frameworks allow users to train
models for additional domain specific types.
+* __Lemmatization__: Often words in a text are not in a form they would appear
in controlled vocabularies (incl. dictionaries). This might result in
Situations where entities are not correctly recognized in the text, because the
found word does not match the label in the vocabulary. Lemmatization help with
that as it provides the base form, known as the _lemma_, for a word.
Based on those techniques Stanbol supports two text enhancement processes
described in the following two sub sections.
### Named Entity Linking
-This chain is based on _Named Entity Recognition_ and than linking recognized
entities with controlled vocabularies. A typical _Enhancement Chain_ contains
the following type of Engines:
+This chain is based on _named entity recognition_ (NER) by linking recognized
entities with controlled vocabularies. A typical enhancement chain contains the
following type of engines:
-* _Language Detection_ (required): The language of the text is needed to
select the correct NLP components for the following processing steps
-* _Sentence Detection_ (optional): If sentences are detected, than processing
of the later steps is done sentence after sentence what definitely improves
performance and might also improve results.
-* _Word Tokenization_ (required): The detection of Named Entities is based on
processing Tokens.
-* _Named Entity Recognition_ (required): The detection of Entities mentioned
in the Text
-* _Named Entity Linking_ (optional): This steps links Entities recognized in
the Text with Entities defined in a _Controlled Vocabulary_.
+* _Language Detection_ (required): The language of the text is needed to
select the correct NLP components for the following processing steps.
+* _Sentence Detection_ (optional): If sentences are detected, the processing
of the later steps is done sentence by sentence instead of the whole text at
once. This improves performance and might also improve results.
+* _Word Tokenization_ (required): The detection of named entities is based on
processed tokens.
+* _Named Entity Recognition_ (required): The recognition of entities mentioned
in the text.
+* _Named Entity Linking_ (optional): Links entities recognized in the text
with entities defined in a controlled vocabulary.
### Entity Linking
-This chain is based on _Part of Speech_, _Chunking_ and _Lematization_
results. It uses those results to lookup _Words_ in the configured _Controlled
Vocabulary_. A typical Enhacement Chain contains the following type of Engines:
+This chain is based on _part of speech_, _chunking_ and _lematization_
analysis. It uses those results to lookup words in a configured controlled
vocabulary. A typical enhacement chain contains the following type of engines:
-* _Language Detection_ (required): The language of the text is needed to
select the correct NLP components for the following processing steps
-* _Sentence Detection_ (optional): If sentences are detected, than processing
of the later steps is done sentence after sentence what definitely improves
performance and might also improve results.
-* _Word Tokenization_ (required): The detection of Named Entities is based on
processing Tokens.
-* _Part of Speech_ (optional): The POS tag of words is used to decide if it
should be linked with the Vocabulary or not. Linked _Lexical Categories_ are
configurable but typically only _Proper Nouns_ or all _Nouns_ are linked.
-* _Noun Phrase Detection_ (optional): If _Chunking_ of _Nouns_ is supported
those information are used to improve linking of multi-word Entities. E.g. two
_Common Nouns_ within the same _Noun Phrase_ are considered as _Proper Noun_.
-* _Lemmatization_ (optional): If configured the Lemma can be used instead of
the _Word_ as mentioned in the text for linking against the controlled
vocabulary.
-* _Entity Linking_ (required): Entity Linking consumes all the above NLP
processing results and uses them to link _Entities_ contained in the configured
_Controlled Vocabulary_ with _Words_ in the text. This process requires (as a
minimum) a correct _Tokenization_ of the Text, but is considerable improved by
_POS_ annotations of _Proper Nouns_ and _Nouns_. _Chunking_ and _Lemmatization_
may further improve results, but their influence on the quality of results is
not as big as of the _POS_ tagging.
+* _Language Detection_ (required): The language of the text is needed to
select the correct NLP components for the following processing steps.
+* _Sentence Detection_ (optional): If sentences are detected, the processing
of the later steps is done sentence by sentence instead of the whole text at
once. This improves performance and might also improve results.
+* _Word Tokenization_ (required): The detection of named entities is based on
processed tokens.
+* _Part of Speech_ (optional): The POS tag of words is used to decide if it
should be linked with the vocabulary or not. Linked _lexical categories_ are
configurable but typically only _proper nouns_ or all _nouns_ are linked.
+* _Noun Phrase Detection_ (optional): If _chunking_ of _nouns_ is supported
those information are used to improve linking of multi-word entities. For
example, two _common nouns_ within the same _noun phrase_ are considered as a
_proper noun_.
+* _Lemmatization_ (optional): If configured the lemma can be used instead of
the word as mentioned in the text for linking against the controlled vocabulary.
+* _Entity Linking_ (required): Entity linking consumes all the above NLP
processing results and uses them to link entities contained in the configured
controlled vocabulary with words in the text. This process requires (as a
minimum) a correct _tokenization_ of the text. It is considerable improved by
POS annotations of proper nouns and nouns. Chunking and lemmatization may
further improve results but their influence on the quality of results is not as
big as of the POS tagging.
-Additional information on how to configure the Apache Stanbol in multilingual
environments are given by the Usage Scenarios [Working with Multiple
Languages](/docs/trunk/multilingual.html).
+Additional information on how to configure the Stanbol in multilingual
environments are given by the usage scenarios on [working with multiple
languages](/docs/trunk/multilingual.html).
NLP processing API
------------------
-The intension of the Stanbol NLP processing API was to efficiently handle word
level NLP processing annotations. Something that was not possible by using the
RDF [metadata of the ContentItem](../contentitem#metadata-of-the-contentitem).
Instead of RDF the NLP processing API defines a JAVA API that consists of the
following two main parts:
+The intention of the Stanbol NLP processing API is to efficiently handle word
level NLP processing annotations. Something that was not possible by using the
RDF [metadata of the contentItem](../contentitem#metadata-of-the-contentitem).
Instead of RDF the NLP processing API defines a JAVA API that consists of the
following two main parts:
-* __[AnalysedText](analyzedtext)__: A data structure that represent parts of
the analyzed text such as _Tokens_, _Chunks_, _Sentences_ and the
_AnalysedText_ itself. All such _Spans_ select an part of the text and are
sorted by their natural order in a _NavigateableMap_. The _AnalysedText_
instance is added to the [ContentItem](../contentitem) as ContentPart and is
parsed therefore between [Enhancement Engines](../engines). Every _Span_ of the
_AnalysedText_ can be annotated with _Annotations_.
-* __[NLP Annotations](nlpannotations)__: The Stanbol NLP processing module
defines Ontology aligned annotation models for typical NLP processing results
such as Part of Speech tagging, Phrase detection, Named Entity Recognition,
full Morphological Analysis as well as Sentiment tags. Those annotations can be
used to annotate _Span_ contained in the _AnalysedText_.
+* __[Analysed Text](analyzedtext)__: A data structure that represent parts of
the analyzed text such as _tokens_, _chunks_, _sentences_ and the analysed text
itself. All such _spans_ represent parts of the text and are sorted by their
natural order in a `NavigateableMap`. The `AnalysedText` instance is added to
the [`ContentItem`](../contentitem) as a `ContentPart` and is therefore parsed
between [enhancement engines](../engines). Every span of the `AnalysedText` can
be annotated with `Annotations`.
+* __[NLP Annotations](nlpannotations)__: The Stanbol NLP processing module
defines ontology aligned annotation models for typical NLP processing results
such as part of speech tagging, phrase detection, named entity recognition,
full morphological analysis, and sentiment tags. Those annotations can be used
to annotate `Span` contained in the `AnalysedText`.
The NLP processing module also provides a default
[in-memory](inmemoryanalyzedtextimpl) implementation of all defined interfaces.
This implementation is used as default by the Stanbol Enhancer.
-Finally the NLP processing module also provides:
+Additionally, the NLP processing module provides:
-* Utilities for [implementing NLP processing EnhancementEngines](nlpengine)
and supports the
-* JSON serialization and parsing support for AnalysedText including NLP
Annotations. Together with the [RESTful NLP Analysis
Engine](../engines/restfulnlpanalysis) this can be used to [Integrate NLP
Frameworks as RESTful Services](restfulnlpanalysisservice)
-* RESTful service definition for a [language identification
service](restfullangidentservice) as well as the [RESTful Language
Identification Engine](../engines/restfullangident). This allows to integrate
language identification features of an NLP framework in a similar way as the
NLP Analyses described above (see
[STANBOL-894](https://issues.apache.org/jira/browse/STANBOL-894) for the
Service specification)
+* Utilities for [implementing NLP processing enhancement engines](nlpengine).
+* JSON serialization and parsing support for analysed text including NLP
annotations. Together with the [RESTful NLP analysis
engine](../engines/restfulnlpanalysis) this can be used to [integrate NLP
frameworks as RESTful services](restfulnlpanalysisservice).
+* RESTful service definition for a [language identification
service](restfullangidentservice) as well as the [RESTful language
identification engine](../engines/restfullangident). This allows to integrate
language identification features of an NLP framework in a similar way as the
NLP analysis described above (see
[STANBOL-894](https://issues.apache.org/jira/browse/STANBOL-894) for the
service specification).
Stanbol Enhancer NLP Support
----------------------------