analyzedtext.mdtext index.mdtext inmemoryanalyzedtextimpl.mdtext nlpannotations

rwesten Fri, 23 Nov 2012 05:11:58 -0800

Author: rwesten
Date: Fri Nov 23 13:11:23 2012
New Revision: 1412870

URL: http://svn.apache.org/viewvc?rev=1412870&view=rev
Log:
STANBOL-733 - documentation - corrected image name


Added:
    stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.mdtext
    stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
    
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/inmemoryanalyzedtextimpl.mdtext
    stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.mdtext?rev=1412870&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/analyzedtext.mdtext
 Fri Nov 23 13:11:23 2012
@@ -0,0 +1,163 @@
+title: AnalysedText
+
+The AnalysedText is a Java domain model designed to describe NLP processing 
results. It describes of two major parts:
+
+1. Structure of the Text such as text-sections, sentences, chunks and tokens
+2. Annotations for the detected parts of the text.
+
+
+## AnalysetText as ContentPart
+
+Within the Stanbol Enhancer the AnalysedText is used as 
[ContentPart](../contentitem#content-parts) registered with the URI 
<code>urn:stanbol.enhancer:nlp.analysedText</code>
+
+Because of that it can be retrieved by using the following code
+
+    :::java
+    AnalysedText at;
+    ci.getLock().readLock().lock();
+    try {
+        at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class);
+    } catch (NoSuchPartException e) {
+        //not present
+        at = null;
+    } finally {
+        ci.getLock().readLock().unlock();
+    }
+
+Components that need to create an AnalysedText instance can do so by using the 
_AnalysedTextFactory_
+
+    :::java
+    @Reference
+    AnalysedTextFactory atf;
+
+    ContentItem ci; //the contentItem
+    AnalysedText at;
+    Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob(
+        ci, Collections.singelton("text/plain"));
+    if(plainTextBlob != null){
+        //creates and adds the AnalysedText ContentPart to the ContentItem
+        ci.getLock().writeLock().lock();
+        try {
+            at = atf.createAnalysedText(ci,plainTextBlob.value());
+        } finally {
+            ci.getLock().writeLock().unlock();
+        }
+    } else { //no NLP processing possible
+        at = null;
+    }
+
+If used outside of OSGI users can also use the 
AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory 
instance of the in-memory implementation.
+
+
+## Structure of the Text
+
+The basic building block of the AnalysedText is the Span. A Span defines type, 
[start,end) as well as the spanText. For the type an enumeration 
(_SpanTypeEnum_) with the members Text, TextSection, Sentence, Chunk and Text. 
[start,end) define the character positions of the Span within the Text where 
the start position is inclusive and the end position is exclusive.
+
+Analog to the type of the Span there are also Java interfaces representing 
those types and providing additional convenience methods. An additional 
_Section_ interface was introduced as common parent for all types that may have 
enclosed Spans. The AnalyzedText is the interface representing 
SpanTypeEnum#Text. The main intension of those Java classes are to have 
convenience methods that ease the use of the API.
+
+### Uniqueness of Spans
+
+A Span is considered equals to an other Span if [start, end) and type are the 
same. The natural oder of Spans is defined by
+
+* smaller start index first
+* bigger end index first
+* higher ordinal number of the SpanTypeEnum first
+
+This order is used by all Iterators returned by the AnalyzedText API
+
+### Concurrent Modifications and Iterators
+
+Iterators returned by the AnalyzedText API MUST throw 
_ConcurrentModificationException_s but rather reflect changes to the 
underlaying model. While this is not constant with the default behavior of 
Iterators in Java this is central for the effective usage of the AnalyzedText 
API - e.g. when Iterating over Sentences while adding Tokens.
+
+### Code Samples:
+
+The following Code Snippet shows some typical usages of the API:
+
+    :::java
+    AnalysedText at; //typically retrieved from the contentPart
+    Iterator<Sentence> sentences = at.getSentences;
+    while(sentences.hasNext){
+        Sentence sentence = sentences.next();
+        String sentText = sentence.getSpan();
+        Iterator<SentenceToken> tokens = sentence.getTokens();
+        while(tokens.hasNext()){
+            Token token = tokens.next();
+            String tokenText = token.getSpan();
+            Value<PosTag> pos = token.getAnnotation(
+                NlpAnnotations.posAnnotation);
+            String tag = pos.value().getTag();
+            double confidence = pos.probability();
+        }
+    }
+
+Code that adds new Spans looks like follows
+
+    :::java
+    //Tokenize an Text
+    Iterator<Sentence> sentences = at.getSentences();
+    Iterator<? extends Section> sections;
+    if(sentences.hasNext()){ //sentence Annotations presnet
+        sections = sentences;
+    } else { //if no sentences tokenize the text at once
+        sections = Collections.singelton(at).iterator();
+    }
+    //Tokenize the sections
+    for(Section section : sentenceList){
+        //assuming the Tokenizer returns tokens as 2dim int array
+        int[][] tokenSpans = tokenizer.tokenize(section.getSpan());
+        for(int ti = 0; ti < tokenSpans.length; ti++){
+            Token token = section.addToken(
+                tokenSpans[ti][0],tokenSpans[ti][1]);
+        }
+    }
+
+For all #add**(start,end) methods in the API the parsed start and end indexes 
are relative to the parent (the one the #add**(..) method is called). The 
[start,end) indexes returned by Spans are absolute values. If an #add**(..) 
method is called for a Span '[start,end):type' that already exists than instead 
of an new instance the already existing one is returned.
+
+
+## Annotation Support
+
+Annotation support is provided by two interfaces _Annotated_ and _Annotation_ 
and the _Value_ class. _Annotated_ provides an API for adding information the 
the annotated object. Those annotations are represented by key value mappings 
where Object is used as key and the _Value_ class for values. The _Value_ class 
provides the generically typed value as well as a double probability in the 
range [0..1] or -1 if not known. Finally the _Annotation_ class is used to 
ensure type safety.
+
+The following example shows the intended usage of the API
+
+1. One needs to define the _Annotations_ one would like to use. Annotations 
are typically defined as public static members of interfaces or classes. The 
following example uses the definition of the Part of Speech annotation.
+
+    :::java
+    public interface NlpAnnotations {
+//an Part of Speech Annotation using a String key
+        //and the PosTag class as value
+        Annotation<String,PosTag> POS_ANNOTATION = new 
Annotation<String,PosTag>(
+            "stanbol.enhancer.nlp.pos", PosTag.class);
+...
+    }
+
+2. Defined _Annotation_ are used to add information to an _Annotated_ instance 
(like a Span). For adding annotations the use of _Annotation_s is required to 
ensure type safety. The following code snippet shows how to add an PosTag with 
the probability 0.95.
+
+    :::java
+    PosTag tag = new PosTag("N"); //a simple POS tag
+    Token token; //The Token we want to add the tag
+    token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);
+
+3. For consuming annotations there are two options. First the possibility to 
use the _Annotation_ object and second by directly using the key. While the 2nd 
option is not as nicely to use (as it does not provide type safety) it allows 
consuming annotations without the need to have the used _Annotation_ in the 
classpath. The following examples show both options
+
+    :::java
+    Iterator<Token> tokens = sentence.getTokens();
+    while(tokens.hasNext){
+        Token token = tokens.next();
+        //use the POS_ANNOTATION to get the PosTag
+        PosTag tag = token.getAnnotation(POS_ANNOTATION);
+        if(tag != null){
+            log.info("{} has PosTag {}",token,tag.value());
+        } else {
+            log.infor("{} has no PosTag",token);
+        }
+        //(2) use the key to retrieve values
+        String key = "urn:test-dummy";
+        Value<?> value = token.getValue(key);
+        //the programmer needs to know the type!
+        if(v.probability() > 0.5){
+            log.info("{}={}",key,value.value());
+        }
+    }
+    
+The _Annotated_ interface supports multi valued annotations. For that it 
defines methods for adding/setting and getting multiple values. Values are 
sorted first by the probability (unknown probability last) and secondly by the 
insert order (first in first out). So calling the single value getAnnotation() 
method on a multi valued field will return the first item (highest probability 
and first added in case of multiple items with the same/no probabilities)
\ No newline at end of file

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext?rev=1412870&view=auto
==============================================================================
--- stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext 
(added)
+++ stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/index.mdtext 
Fri Nov 23 13:11:23 2012
@@ -0,0 +1,12 @@
+title: NLP processing module
+
+The NLP processing module for the Stanbol Enhancer was introduced by 
[STANBOL-733](https://issues.apache.org/jira/browse/STANBOL-733) and is only 
available to Stanbol Enhancer starting from version <code>0.10.0</code>
+
+It intension was to efficiently handle word level NLP processing annotations 
as such kind of annotations would have created to manny RDF triples to handle 
them in the [metadata of the 
ContentItem](../contentitem#metadata-of-the-contentitem).
+
+The Module contains of the following parts:
+
+* __[AnalyzedText](analyzedtext)__: A data structure that represent an text in 
_Span_s like _Token_s, _Chunk_s, _Sentence_s, _TextSection_s and the 
_AnalyzedText_ itself selecting the text as a whole. In addition all spans can 
be annotated with additional information by using the _Annotated_ interface.
+* __[NLP Annotations](nlpannotations)__: The Stanbol NLP processing module 
defines Ontology aligned annotation models for typical NLP processing results 
such as Part of Speech tagging, Phrase detection, Named Entity Recognition and 
full Morphological Analysis. This annotations models can than be stored to the 
different _Span_s define in the _AnalyzedText_
+
+In addition the NLP processing module provides a default 
[in-memory](inmemoryanalyzedtextimpl) implementation of all defined interfaces 
that is sufficient for all current Stanbol use cases.
\ No newline at end of file

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/inmemoryanalyzedtextimpl.mdtext
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/inmemoryanalyzedtextimpl.mdtext?rev=1412870&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/inmemoryanalyzedtextimpl.mdtext
 (added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/inmemoryanalyzedtextimpl.mdtext
 Fri Nov 23 13:11:23 2012
@@ -0,0 +1,21 @@
+Title: In-Memory AnalyzedText and Annotation implementation
+
+This describes the implementation of the [Analyzed Text](analysedtext) used by 
default by the Stanbol NLP processing module. This implementation is directly 
contained within the org.apache.stanbol.enhancer.nlp module.
+
+## AnalyzedTextFactory
+
+The AnalyzedTextFactory of the in-memory implementation registers itself as 
OSGI service with an "service.ranking" of Integer.MIN_VALUE. That means that 
any other registered AnalyzedTextFactory will override this one (unless it does 
not use Integer.MIN_VALUE itself).
+
+The implementation uses the ContentItemHelper#getText(Blob blob) method to 
retrieve the text from the parsed blob. The text is than used to create an 
AnalyzedText instance.
+
+## AnalyzedText Implementation
+
+The in-memory implementation is based on a NavigableMap that uses the same 
span as both key and value. TreeMap is currently used as implementation. The 
compareTo(..) method of the Span implementation ensures the correct ordering of 
Spans as specified by the [Analyzed Text](analyzedtext) interface. All 
add**(..) methods first check if a span with the added type, [start,end) is 
already contained. If this is the case the current span is returned otherwise 
an new instance is created.
+
+The Iterator implementation is not based on the Iterators provided by the 
NavigableMap as those would throw ConcurrentModificationExceptions - what is 
prohibited by the specification. Instead in implementation that is based on the 
#higherKey() method is used. Filtered Iterators are implemented using Apache 
Commons Collections FilteredIterator utility with an Predicate based on the 
SpanTypeEnum.
+
+## Annotation Implementation
+
+The implementation of the _Annotated_ interface is similar to that of the 
SolrInputDocument. Internally it uses a Map<Object,Object> to store data. When 
a single value is added it is directly store in the map. In case of multiple 
values data are stored in Arrays. Arrays are sorted by an comparator that 
ensures that the value with the highest probability is at index '0'.
+
+Type safety is not checked so creating multiple Annotations with different 
value types that share the same key will cause ClassCastExceptions at runtime. 
\ No newline at end of file

Added: 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations
URL: 
http://svn.apache.org/viewvc/stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations?rev=1412870&view=auto
==============================================================================
--- 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations 
(added)
+++ 
stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp/nlpannotations 
Fri Nov 23 13:11:23 2012
@@ -0,0 +1,222 @@
+title: NLP Annotations
+
+While the The [Analyzed Text](analyzedtext) interface allows to define 
Sentences, Chunks and Tokens within the text and also to attach annotations to 
those this part of the Stanbol NLP processing module provides the Java domain 
model for the annotations section this part of the Stanbol NLP processing 
module defines the Java domain model used for those annotations. This includes 
annotation models for Part of Speech (POS) tags, Chunks , recognized Named 
Entities (NER) as well as morphological analysis.
+
+### Part of Speech (POS) annotations
+
+Part of Speech (POS) tagging represents an token level annotation. It assigns 
tokens with categories like noun, verb, adjectives, punctuation ... This 
annotations are typically provided by an POS tagger that consumes Tokens and 
provides tag(s) with confidence(s) as output. Tags are usually string values 
that are member of a TagSet - a fixed list of tags used to annotate tokens. 
Those Tag sets are typically language and often even trainings corpus specific. 
This makes it really hard to consume POS tags created by different POS tagger 
for different languages as the consumer would need to know about the meanings 
of all the different POS tags for the different languages.
+
+The POS annotation model defined by the Stanbol NLP module tries to solve this 
issue by providing means to align POS tag sets with formal categories defined 
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The following sub-section 
will provide details and usage examples.
+
+#### OLiA MorphosyntacticCategories
+
+The '[OLiA](http://nlp2rdf.lod2.eu/olia/) Reference Model for Morphology and 
Morphosyntax, with experimental extension to Syntax' defines a set of ~150 
formally defined and multi-lingual POS tags. Those types are defined as a 
non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.
+
+To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' 
what itself is a 'olia:Verb'. An example for a multi-hierarchy is 
'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.
+
+To allow support a nice integration of the formal definitions by the OLiA 
ontology within the Stanbol NLP annotations there are two Java enumerations:
+
+* __LexicalCategories__: This enumeration covers the 12 top level categories 
as defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb, 
Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier, 
Residual and Unique.
+* __Pos__: This enumeration covers all OLiA MorphosyntacticCategories from the 
2+ level. So by using the _Pos_ enum one can e.g. distinguish between 
ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The _Pos_ 
enumeration has full support for the multi-hierarchy as defined by OLiA. The 
Pos#categories() methods allows to get the 1st level parents of _Pos_. The 
Pos#hierarchy() returns all 2+ level parents of a _Pos_ member.
+
+#### PosTag and TagSet
+
+The PosTag represents a POS tag as used by an POS tagger. PosTags do support 
the following features:
+
+* __tag__ [1..1]::Stirng - This is the string tag as used by the POS tagger.
+* __category__ [0..*]::LexicalCategory - The assigned LexicalCategory 
enumeration members.
+* __pos__ [0..*]::Pos - The assigned Pos enumeration members.
+
+An Example for a PosTag representing a 'olia:ProperNoun' looks like follows
+
+    :::java
+    PosTag tag = new PosTag("NP", Pos.ProperNoun);
+
+The first parameter is the String POS tag used by the POS tagger and the 
second parameter represents the mapping to the OLiA MorphosyntacticCategories 
for this tag. The next example shows an sofisticated mapping for the "PWAV" 
(Pronominaladverb) as used by the STTS tag set for the German language
+
+    :::java
+    new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, 
Pos.InterrogativePronoun);
+
+_TagSet_ is the other important class as it allows to manage the set of PosTag 
instances. _TagSet_ has two main functions: First it allows an integrator of an 
POS tagger with Stanbol to define the mappings from the string POS tags used by 
the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable 
used by the Stanbol NLP chain. Second it ensures that there is only a single 
instance of PosTag used to annotate all Tokens with the same type.
+
+_TagSet_s are typically specified as static members of utility classes. The 
following code snippet shows an example
+
+    :::java
+    //Tagset is generically typed. We need a TagSet for PosTag's
+    public static final TagSet<PosTag> STTS = new TagSet<PosTag>(
+        "STTS", "de"); //define a name and the languages it supports
+
+    static {
+        //you can set properties to a TagSet. While supported this
+        //feature is currently not used by Stanbol
+        STTS.getProperties().put("olia.annotationModel",
+            new UriRef("http://purl.org/olia/stts.owl";));
+        STTS.getProperties().put("olia.linkingModel",
+            new UriRef("http://purl.org/olia/stts-link.rdf";));
+        STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective));
+        STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective));
+        STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb));
+//[...]
+    }
+
+The string tag (first parameter) of the _PosTag_ is used as unique key by the 
_TagSet_. Adding an 2nd _PasTag_ with the same tag will override the first one. 
_PosTag_s that are added to a _TagSet_ have the _Tag#getAnnotationModel()_ 
property set to that model.
+
+The final example shows a code snippet shows the core part of an POS tagging 
engine using the both the [AnalyzedText](analyzedtext) and the _PosTag_ and 
_TagSet_ APIs.
+
+    :::java
+    TagSet<PosTag> tagSet; //the used TagSet
+    //holds PosTags for tags returned by the POS tagger that
+    //are missing in the TagSet
+    Map<String,PosTag> adhocTags = new HashMap<String,PosTag>():
+    List<Span> token = new ArrayList<Span>(64);
+
+    Iterator<Section> sentences; //Iterator over the sentences
+
+    while(sentences.hasNext()){
+        Section sentence = sentences.next();
+        //get the tokens of the current sentence
+        token.clean();
+        AnalysedTextUtils.appandToList(
+            sentence.getEnclosed(SpanTypeEnum.Token),
+            tokenList);
+        //typically one needs also to get the Strings
+        //of the tokens for the pos tagger
+        String[] tokenText = new String[tokenList.size()];
+        for(int i=0;i<tokens.size();i++){
+            tokenText[i] = tokens.get(i).getSpan();
+        }
+
+        //now POS tag the sentence
+        String[] posTags = posTagger.tag(tokens);
+
+        //finally apply the PosTags and save the annotation
+        for(int i=0;i<tokens.size();i++){
+            PosTag tag = tagSet.get(posTags[i]);
+            if(tag == null) { //unmapped tag
+                tag = adhocTags.get(posTags[i]);
+            }
+            if(tag == null) { //unknown tag
+                tag = new PosTag(posTags[i]);
+                adhocTags.put(posTags[i],tag);
+            }
+            //add the annotation to the Token
+            token.addAnnotation(
+                NlpAnnotations.POS_ANNOTATION,
+                Value.value(tag));
+        }
+    }
+
+### Phrase annotations
+
+Phrase annotations can be used to define the type of a _Chunk_. The 
_PhraseTag_ class is used for phrase annotations. It defines first a string tag 
and secondly the Phrase category. The _LexicalCategory_ enumeration is used as 
valued for the category. As the _PhraseTag_ is a subclass of _Tag_ it can be 
also used in combination with the _TagSet_ class as described in the [PosTag 
and TagSet] section.
+
+The following code snippets show how to create a PhraseTag for noun phrases
+
+    :::java
+    PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);
+
+  
+
+### Name Entity (NER) annotations
+
+Named Entity annotations are created by NER modules. Before the Stanbol NLP 
chain they where represented in Stanbol by using 
'[fise:TextAnnotation](../enhancementstructure#fisetextannotation)'s and any 
Enhancement Engine that does NER should still support this. With the Stanbol 
NLP processing module it is now also possible to represent detected Named 
Entities as _Chunk_ with an PhraseTag added as Annotation.
+
+A Named Entity represented as 'fise:TextAnnotation' includes the following 
information:
+
+    urn:namedEntity:1
+        rdf:type fise:TextAnnotation, fise:Enhancement
+        fise:selected-text {named-entity-text}
+        fise:start {start-char-pos}
+        fise:end {end-char-pos}
+        dc:type {named-entity-type}
+
+where:
+
+* {named-entity-text} is the text recognized as Named Entity. This is the same 
as returned by _Chunk#getSpan()_
+* {start-char-pos} is the start character position of the Named Entity 
relative to the start of the text. This is the same as _Chunk#getStart()_
+* {end-char-pos} is the end position and the same as _Chunk#getEnd()_
+* {named-enttiy-type} is the type of the recognized Named Entity as URI. The 
_PhraseTag allows to define both the string tag as used by the NER component as 
well as the URI this type is mapped to. In Stanbol it is preferred to use 
'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according 
entity types.
+
+The _NerTag_ class extends _Tag_ and can therefore be also used with the 
_TagSet_ class. This means that users of the API can use _TagSet_ to manage the 
string tag to URI mappings for the supported Named Entity types.
+
+The following Code Snippets shows how to add NER annotations to the 
AnalysedText:
+
+    :::java
+    AnalysedText at; //The AnalysedText
+    TagSet<NerTag> nerTags; //registered NER tags
+    Iterator<Section> sections; //sections to iterate over
+
+    List<String> tokenTexts = new ArrayList<Span>(64);
+
+    while(sections.hasNext()){
+        Section section = sections.next();
+        //NER tagger typically need String[] as input
+        token.clean();
+Iterator<Token> tokens = section.getTokens;
+        while(tokens.hasNext()){
+            tokenTexts.add(tokens.next().getSpan());
+        }
+        //Span -> #start #end #type #probability
+Span[] nerSpans = nerTagger.tag(
+            tokenTexts.toArray(new String[tokenTexts.size()]);
+        for(int i=0; i < nerSpans.length; i++){
+            Chunk namedEntity = at.addChunk(
+                nerSpans[i].start,nerSpans[i].start);
+            NerTag tag = nerTags.get(nerSpans[i].type)
+            if(tag == null){ //unmapped NER
+                tag = new NerTag(nerSpans[i].type);
+            }
+            namedEntity.addAnnotation(
+                NlpAnnotations.NER_ANNOTATION,
+                Value.value(tag, nerSpans[i]. probability));
+        }
+    }
+   
+Note that the above Code Snippet only shows how to add the Named Entity to the 
AnalyzedText ContentPart. A actual NER engine Implementation needs also to add 
those information to the metadata of the [ContentItem](../contentitem).
+
+    :::java
+    ContentItem ci; //The processed ContentItem
+    Language lang; //The Language of the processed Text
+    MGraph metadata = ci.getMetadata();
+    Section section; //the current Section
+    Chunk namedEntity //the currently processed Named Entity
+
+    Value<NerTag> nerAnnotation = namedEntity.getAnnotation(
+        NlpAnnotations.NER_ANNOTATION);
+
+    UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, 
this);
+    metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT,
+        new PlainLiteralImpl(namedEntity.getSpan(), language)));
+    metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT,
+        new PlainLiteralImpl(section.getSpan(), language)));
+    if(tag.getType() != null){
+        metadata.add(new TripleImpl(textAnnotation, DC_TYPE,
+            nerAnnotation.value().getType));
+    } //else do not add an dc:type for unmapped NamedEntities
+    g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE,
+        literalFactory.createTypedLiteral(nerAnnotation.probability())));
+    g.add(new TripleImpl(textAnnotation, ENHANCER_START,
+        literalFactory.createTypedLiteral(namedEntity.getStart()));
+    g.add(new TripleImpl(textAnnotation, ENHANCER_END,
+        literalFactory.createTypedLiteral(namedEntity.getEnd())));
+
+
+### Morphological Analyses
+
+
+__NOTE:__ _This part of the Stanbol NLP annotations is still work in progress. 
So this part of the API might undergo heavy changes even in minor releases._
+
+
+The results of a Morphological Analyses are represented by the 
_MorphoFeatures_ class and can be added to the analyzed word (_Token_) by using 
the _NlpAnnotations.MORPHO_ANNOTATION_. The _MorphoFeatures_ class provides the 
following features:
+
+* __Lemma__: A String value representing the lemmatization of the annotated 
Token.
+* __Case__: The _Case_ enumeration contains around 70 members defined based on 
concepts of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _CaseTag_ 
allows to define cases and optionally map them to the cases defined by the 
enumeration.
+* __Definitness__: The _Definitness_ enumeration has the members Definite and 
Indefinite also defined by Concepts in the [OLiA 
Ontology](http://nlp2rdf.lod2.eu/olia/).
+* __Gender__: The _Gender_ enumeration contains the six gender defined by the 
[OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _GenderTag_ allows to define 
Genders and optionally map them to the gender defined by the enumeration.
+* __Number__: The _NumberFeature_ enumeration defines the eight number 
features defined by [OLiA](http://nlp2rdf.lod2.eu/olia/). The _NumberTag_ can 
be used to define number features and map them to the members of the enumeration
+* __Person__: the _Person_ enumeration has the definitions for 'first', 
'second' and 'third' with mappings to the according concepts of the [OLiA 
Ontology](http://nlp2rdf.lod2.eu/olia/).
+* __Tense__: The _Tense_ enumeration represents the tense hierarchy as defined 
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). the _Tense#getParent()_ 
allows access to the direct parent of a _Tense_ while the _Tense#getTenses()_ 
method can be used to obtain the transitive closure (including the _Tens_ 
object itself). _TenseTag_ is used for Tense annotations. It allows both to 
parse a string tag representing the tense as well as defining a mapping to the 
tenses defined by the _Tense_ enumeration.
+* __Mood__: The _VerbMood_ enumeration currently defines members from 
different part of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). While OLiA 
does define the 'ilia:MoodFeature' class but those members had not a good match 
with verb moods as used by the CELI/linguagrid.org service. For now the 
decision was to define the _VerbMood_ enumeration more closely to the usage of 
CELI, but this needs clearly to be validated as soon as implementations for 
other NLP frameworks are added. Their is also a _VerbMoodTag_ that allows to 
define verb moods by a string tag and an mapping to the _VerbMood_ enumeration.
+
+ 
+The _MorphoFeatures_ supports multi valued annotations for all the above 
features. Getter for a single value will always return the first added value.
\ No newline at end of file

svn commit: r1412870 - in /stanbol/site/trunk/content/docs/trunk/components/enhancer/nlp: ./ analyzedtext.mdtext index.mdtext inmemoryanalyzedtextimpl.mdtext nlpannotations

Reply via email to