[
https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501777#comment-13501777
]
Rupert Westenthaler commented on STANBOL-734:
---------------------------------------------
Documentation for the NLP Annotations
NLP Annotations
===========
While the The [Analyzed Text](analyzedtext) interface allows to define
Sentences, Chunks and Tokens within the text and also to attach annotations to
those this part of the Stanbol NLP processing module provides the Java domain
model for the annotations section this part of the Stanbol NLP processing
module defines the Java domain model used for those annotations. This includes
annotation models for Part of Speech (POS) tags, Chunks , recognized Named
Entities (NER) as well as morphological analysis.
### Part of Speech (POS) annotations
Part of Speech (POS) tagging represents an token level annotation. It assigns
tokens with categories like noun, verb, adjectives, punctuation ... This
annotations are typically provided by an POS tagger that consumes Tokens and
provides tag(s) with confidence(s) as output. Tags are usually string values
that are member of a TagSet - a fixed list of tags used to annotate tokens.
Those Tag sets are typically language and often even trainings corpus specific.
This makes it really hard to consume POS tags created by different POS tagger
for different languages as the consumer would need to know about the meanings
of all the different POS tags for the different languages.
The POS annotation model defined by the Stanbol NLP module tries to solve this
issue by providing means to align POS tag sets with formal categories defined
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The following sub-section
will provide details and usage examples.
#### OLiA MorphosyntacticCategories
The '[OLiA](http://nlp2rdf.lod2.eu/olia/) Reference Model for Morphology and
Morphosyntax, with experimental extension to Syntax' defines a set of ~150
formally defined and multi-lingual POS tags. Those types are defined as a
non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.
To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb'
what itself is a 'olia:Verb'. An example for a multi-hierarchy is
'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.
To allow support a nice integration of the formal definitions by the OLiA
ontology within the Stanbol NLP annotations there are two Java enumerations:
* __LexicalCategories__: This enumeration covers the 12 top level categories as
defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb,
Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier,
Residual and Unique.
* __Pos__: This enumeration covers all OLiA MorphosyntacticCategories from the
2+ level. So by using the _Pos_ enum one can e.g. distinguish between
ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The _Pos_
enumeration has full support for the multi-hierarchy as defined by OLiA. The
Pos#categories() methods allows to get the 1st level parents of _Pos_. The
Pos#hierarchy() returns all 2+ level parents of a _Pos_ member.
#### PosTag and TagSet
The PosTag represents a POS tag as used by an POS tagger. PosTags do support
the following features:
* __tag__ [1..1]::Stirng - This is the string tag as used by the POS tagger.
* __category__ [0..*]::LexicalCategory - The assigned LexicalCategory
enumeration members.
* __pos__ [0..*]::Pos - The assigned Pos enumeration members.
An Example for a PosTag representing a 'olia:ProperNoun' looks like follows
:::java
PosTag tag = new PosTag("NP", Pos.ProperNoun);
The first parameter is the String POS tag used by the POS tagger and the second
parameter represents the mapping to the OLiA MorphosyntacticCategories for this
tag. The next example shows an sofisticated mapping for the "PWAV"
(Pronominaladverb) as used by the STTS tag set for the German language
:::java
new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun,
Pos.InterrogativePronoun);
_TagSet_ is the other important class as it allows to manage the set of PosTag
instances. _TagSet_ has two main functions: First it allows an integrator of an
POS tagger with Stanbol to define the mappings from the string POS tags used by
the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable
used by the Stanbol NLP chain. Second it ensures that there is only a single
instance of PosTag used to annotate all Tokens with the same type.
_TagSet_s are typically specified as static members of utility classes. The
following code snippet shows an example
:::java
//Tagset is generically typed. We need a TagSet for PosTag's
public static final TagSet<PosTag> STTS = new TagSet<PosTag>(
"STTS", "de"); //define a name and the languages it supports
static {
//you can set properties to a TagSet. While supported this
//feature is currently not used by Stanbol
STTS.getProperties().put("olia.annotationModel",
new UriRef("http://purl.org/olia/stts.owl"));
STTS.getProperties().put("olia.linkingModel",
new UriRef("http://purl.org/olia/stts-link.rdf"));
STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective));
STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective));
STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb));
//[...]
}
The string tag (first parameter) of the _PosTag_ is used as unique key by the
_TagSet_. Adding an 2nd _PasTag_ with the same tag will override the first one.
_PosTag_s that are added to a _TagSet_ have the _Tag#getAnnotationModel()_
property set to that model.
The final example shows a code snippet shows the core part of an POS tagging
engine using the both the [AnalyzedText](analyzedtext) and the _PosTag_ and
_TagSet_ APIs.
:::java
TagSet<PosTag> tagSet; //the used TagSet
//holds PosTags for tags returned by the POS tagger that
//are missing in the TagSet
Map<String,PosTag> adhocTags = new HashMap<String,PosTag>():
List<Span> token = new ArrayList<Span>(64);
Iterator<Section> sentences; //Iterator over the sentences
while(sentences.hasNext()){
Section sentence = sentences.next();
//get the tokens of the current sentence
token.clean();
AnalysedTextUtils.appandToList(
sentence.getEnclosed(SpanTypeEnum.Token),
tokenList);
//typically one needs also to get the Strings
//of the tokens for the pos tagger
String[] tokenText = new String[tokenList.size()];
for(int i=0;i<tokens.size();i++){
tokenText[i] = tokens.get(i).getSpan();
}
//now POS tag the sentence
String[] posTags = posTagger.tag(tokens);
//finally apply the PosTags and save the annotation
for(int i=0;i<tokens.size();i++){
PosTag tag = tagSet.get(posTags[i]);
if(tag == null) { //unmapped tag
tag = adhocTags.get(posTags[i]);
}
if(tag == null) { //unknown tag
tag = new PosTag(posTags[i]);
adhocTags.put(posTags[i],tag);
}
//add the annotation to the Token
token.addAnnotation(
NlpAnnotations.POS_ANNOTATION,
Value.value(tag));
}
}
### Phrase annotations
Phrase annotations can be used to define the type of a _Chunk_. The _PhraseTag_
class is used for phrase annotations. It defines first a string tag and
secondly the Phrase category. The _LexicalCategory_ enumeration is used as
valued for the category. As the _PhraseTag_ is a subclass of _Tag_ it can be
also used in combination with the _TagSet_ class as described in the [PosTag
and TagSet] section.
The following code snippets show how to create a PhraseTag for noun phrases
:::java
PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);
### Name Entity (NER) annotations
Named Entity annotations are created by NER modules. Before the Stanbol NLP
chain they where represented in Stanbol by using
'[fise:TextAnnotation](../enhancementstructure#fisetextannotation)'s and any
Enhancement Engine that does NER should still support this. With the Stanbol
NLP processing module it is now also possible to represent detected Named
Entities as _Chunk_ with an PhraseTag added as Annotation.
A Named Entity represented as 'fise:TextAnnotation' includes the following
information:
urn:namedEntity:1
rdf:type fise:TextAnnotation, fise:Enhancement
fise:selected-text {named-entity-text}
fise:start {start-char-pos}
fise:end {end-char-pos}
dc:type {named-entity-type}
where:
* {named-entity-text} is the text recognized as Named Entity. This is the same
as returned by _Chunk#getSpan()_
* {start-char-pos} is the start character position of the Named Entity relative
to the start of the text. This is the same as _Chunk#getStart()_
* {end-char-pos} is the end position and the same as _Chunk#getEnd()_
* {named-enttiy-type} is the type of the recognized Named Entity as URI. The
_PhraseTag allows to define both the string tag as used by the NER component as
well as the URI this type is mapped to. In Stanbol it is preferred to use
'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according
entity types.
The _NerTag_ class extends _Tag_ and can therefore be also used with the
_TagSet_ class. This means that users of the API can use _TagSet_ to manage the
string tag to URI mappings for the supported Named Entity types.
The following Code Snippets shows how to add NER annotations to the
AnalysedText:
:::java
AnalysedText at; //The AnalysedText
TagSet<NerTag> nerTags; //registered NER tags
Iterator<Section> sections; //sections to iterate over
List<String> tokenTexts = new ArrayList<Span>(64);
while(sections.hasNext()){
Section section = sections.next();
//NER tagger typically need String[] as input
token.clean();
Iterator<Token> tokens = section.getTokens;
while(tokens.hasNext()){
tokenTexts.add(tokens.next().getSpan());
}
//Span -> #start #end #type #probability
Span[] nerSpans = nerTagger.tag(
tokenTexts.toArray(new String[tokenTexts.size()]);
for(int i=0; i < nerSpans.length; i++){
Chunk namedEntity = at.addChunk(
nerSpans[i].start,nerSpans[i].start);
NerTag tag = nerTags.get(nerSpans[i].type)
if(tag == null){ //unmapped NER
tag = new NerTag(nerSpans[i].type);
}
namedEntity.addAnnotation(
NlpAnnotations.NER_ANNOTATION,
Value.value(tag, nerSpans[i]. probability));
}
}
Note that the above Code Snippet only shows how to add the Named Entity to the
AnalyzedText ContentPart. A actual NER engine Implementation needs also to add
those information to the metadata of the [ContentItem](../contentitem).
:::java
ContentItem ci; //The processed ContentItem
Language lang; //The Language of the processed Text
MGraph metadata = ci.getMetadata();
Section section; //the current Section
Chunk namedEntity //the currently processed Named Entity
Value<NerTag> nerAnnotation = namedEntity.getAnnotation(
NlpAnnotations.NER_ANNOTATION);
UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci,
this);
metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT,
new PlainLiteralImpl(namedEntity.getSpan(), language)));
metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT,
new PlainLiteralImpl(section.getSpan(), language)));
if(tag.getType() != null){
metadata.add(new TripleImpl(textAnnotation, DC_TYPE,
nerAnnotation.value().getType));
} //else do not add an dc:type for unmapped NamedEntities
g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE,
literalFactory.createTypedLiteral(nerAnnotation.probability())));
g.add(new TripleImpl(textAnnotation, ENHANCER_START,
literalFactory.createTypedLiteral(namedEntity.getStart()));
g.add(new TripleImpl(textAnnotation, ENHANCER_END,
literalFactory.createTypedLiteral(namedEntity.getEnd())));
### Morphological Analyses
__NOTE:__ _This part of the Stanbol NLP annotations is still work in progress.
So this part of the API might undergo heavy changes even in minor releases._
The results of a Morphological Analyses are represented by the _MorphoFeatures_
class and can be added to the analyzed word (_Token_) by using the
_NlpAnnotations.MORPHO_ANNOTATION_. The _MorphoFeatures_ class provides the
following features:
* __Lemma__: A String value representing the lemmatization of the annotated
Token.
* __Case__: The _Case_ enumeration contains around 70 members defined based on
concepts of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _CaseTag_
allows to define cases and optionally map them to the cases defined by the
enumeration.
* __Definitness__: The _Definitness_ enumeration has the members Definite and
Indefinite also defined by Concepts in the [OLiA
Ontology](http://nlp2rdf.lod2.eu/olia/).
* __Gender__: The _Gender_ enumeration contains the six gender defined by the
[OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _GenderTag_ allows to define
Genders and optionally map them to the gender defined by the enumeration.
* __Number__: The _NumberFeature_ enumeration defines the eight number features
defined by [OLiA](http://nlp2rdf.lod2.eu/olia/). The _NumberTag_ can be used to
define number features and map them to the members of the enumeration
* __Person__: the _Person_ enumeration has the definitions for 'first',
'second' and 'third' with mappings to the according concepts of the [OLiA
Ontology](http://nlp2rdf.lod2.eu/olia/).
* __Tense__: The _Tense_ enumeration represents the tense hierarchy as defined
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). the _Tense#getParent()_
allows access to the direct parent of a _Tense_ while the _Tense#getTenses()_
method can be used to obtain the transitive closure (including the _Tens_
object itself). _TenseTag_ is used for Tense annotations. It allows both to
parse a string tag representing the tense as well as defining a mapping to the
tenses defined by the _Tense_ enumeration.
* __Mood__: The _VerbMood_ enumeration currently defines members from different
part of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). While OLiA does
define the 'ilia:MoodFeature' class but those members had not a good match with
verb moods as used by the CELI/linguagrid.org service. For now the decision was
to define the _VerbMood_ enumeration more closely to the usage of CELI, but
this needs clearly to be validated as soon as implementations for other NLP
frameworks are added. Their is also a _VerbMoodTag_ that allows to define verb
moods by a string tag and an mapping to the _VerbMood_ enumeration.
The _MorphoFeatures_ supports multi valued annotations for all the above
features. Getter for a single value will always return the first added value.
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
> Key: STANBOL-734
> URL: https://issues.apache.org/jira/browse/STANBOL-734
> Project: Stanbol
> Issue Type: Sub-task
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word
> granularity - is not feasible using the RDF metadata this describes the
> addition of a special ContentPart Stanbol. This ContentPart will have the
> name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
> AnalysedText at; //the contentPart
> Iterator<Sentence> sentences = at.getSentences;
> while(sentences.hasNext){
> Sentence sentence = sentences.next();
> String sentText = sentence.getSpan();
> Iterator<SentenceToken> tokens = sentence.getTokens();
> while(tokens.hasNext()){
> Token token = tokens.next();
> String tokenText = token.getSpan();
> Value<PosTag> pos = token.getAnnotation(
> NlpAnnotations.posAnnotation);
> String tag = pos.value().getTag();
> double confidence = pos.probability();
> }
> }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
> TagSet<PosTag> tagSet; //the used TagSet
> Map<String,PosTag> unknown; //missing tags in the TagSet
> Token token; //the token
> String tag; //the detected tag
> double prob; //the probability
> PosTag pos = tagset.getTag(tag);
> if(pos == null){ //unkonw tag
> pos = unknown.get(tag);
> }
> if(pos == null) {
> pos = new PosTag(tag);
> //this tag will not have a LexicalCategory
> unknown.add(pos); //only one instance
> }
> token.addAnnotation(
> NlpAnnotations.POSAnnotation,
> new Value<PosTag>(pos, prob));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira