[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Rupert Westenthaler (JIRA) Wed, 21 Nov 2012 00:52:05 -0800

    [ 
https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501777#comment-13501777
 ]


Rupert Westenthaler commented on STANBOL-734:
---------------------------------------------

Documentation for the NLP Annotations

NLP Annotations
===========

While the The [Analyzed Text](analyzedtext) interface allows to define 
Sentences, Chunks and Tokens within the text and also to attach annotations to 
those this part of the Stanbol NLP processing module provides the Java domain 
model for the annotations  section this part of the Stanbol NLP processing 
module defines the Java domain model used for those annotations. This includes 
annotation models for  Part of Speech (POS) tags, Chunks , recognized Named 
Entities (NER) as well as morphological analysis.

### Part of Speech (POS) annotations

Part of Speech (POS) tagging represents an token level annotation. It assigns 
tokens with categories like noun, verb, adjectives, punctuation ... This 
annotations are typically provided by an POS tagger that consumes Tokens and 
provides tag(s) with confidence(s) as output. Tags are usually string values 
that are member of a TagSet - a fixed list of tags used to annotate tokens. 
Those Tag sets are typically language and often even trainings corpus specific. 
This makes it really hard to consume POS tags created by different POS tagger 
for different languages as the consumer would need to know about the meanings 
of all the different POS tags for the different languages.

The POS annotation model defined by the Stanbol NLP module tries to solve this 
issue by providing means to align POS tag sets with formal categories defined 
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The following sub-section 
will provide details and usage examples.

#### OLiA MorphosyntacticCategories

The '[OLiA](http://nlp2rdf.lod2.eu/olia/) Reference Model for Morphology and 
Morphosyntax, with experimental extension to Syntax' defines a set of ~150 
formally defined and multi-lingual POS tags. Those types are defined as a 
non-cyclic multi-hierarchy with 'oilia:MorphosyntacticCategory' as common root.

To give an example the POS 'olia:Gerund' is defined as a 'olia:NonFiniteVerb' 
what itself is a 'olia:Verb'. An example for a multi-hierarchy is 
'olia:NominalQuantifier' that is both a 'olia:Noun' and a 'olia:Quantifier'.

To allow support a nice integration of the formal definitions by the OLiA 
ontology within the Stanbol NLP annotations there are two Java enumerations:

* __LexicalCategories__: This enumeration covers the 12 top level categories as 
defined by OLiA. This includes Noun, Verb, Adjective, Adposition, Adverb, 
Conjuction, Interjection, PronounOrDeterminer, Punctuation, Quantifier, 
Residual and Unique.
* __Pos__: This enumeration covers all OLiA MorphosyntacticCategories from the 
2+ level. So by using the _Pos_ enum one can e.g. distinguish between 
ProperNoun's and CommonNoun's or FiniteVerb's and NonFiniteVerb's ... The _Pos_ 
enumeration has full support for the multi-hierarchy as defined by OLiA. The 
Pos#categories() methods allows to get the 1st level parents of _Pos_. The 
Pos#hierarchy() returns all 2+ level parents of a _Pos_ member.

#### PosTag and TagSet 

The PosTag represents a POS tag as used by an POS tagger. PosTags do support 
the following features:

* __tag__ [1..1]::Stirng - This is the string tag as used by the POS tagger. 
* __category__ [0..*]::LexicalCategory - The assigned LexicalCategory 
enumeration members.
* __pos__ [0..*]::Pos - The assigned Pos enumeration members.

An Example for a PosTag representing a 'olia:ProperNoun' looks like follows

    :::java
    PosTag tag = new PosTag("NP", Pos.ProperNoun);

The first parameter is the String POS tag used by the POS tagger and the second 
parameter represents the mapping to the OLiA MorphosyntacticCategories for this 
tag. The next example shows an sofisticated mapping for the "PWAV" 
(Pronominaladverb) as used by the STTS tag set for the German language

    :::java
    new PosTag("PWAV", LexicalCategory.Adverb, Pos.RelativePronoun, 
Pos.InterrogativePronoun);

_TagSet_ is the other important class as it allows to manage the set of PosTag 
instances. _TagSet_ has two main functions: First it allows an integrator of an 
POS tagger with Stanbol to define the mappings from the string POS tags used by 
the Pos Tagger to the LexicalCategory and Pos enumeration members as preferable 
used by the Stanbol NLP chain. Second it ensures that there is only a single 
instance of PosTag used to annotate all Tokens with the same type.

_TagSet_s are typically specified as static members of utility classes. The 
following code snippet shows an example

    :::java
    //Tagset is generically typed. We need a TagSet for PosTag's
    public static final TagSet<PosTag> STTS = new TagSet<PosTag>(
        "STTS", "de"); //define a name and the languages it supports

    static {
        //you can set properties to a TagSet. While supported this
        //feature is currently not used by Stanbol
        STTS.getProperties().put("olia.annotationModel", 
            new UriRef("http://purl.org/olia/stts.owl";));
        STTS.getProperties().put("olia.linkingModel", 
            new UriRef("http://purl.org/olia/stts-link.rdf";));
        STTS.addTag(new PosTag("ADJA", Pos.AttributiveAdjective));
        STTS.addTag(new PosTag("ADJD", Pos.PredicativeAdjective));
        STTS.addTag(new PosTag("ADV", LexicalCategory.Adverb));
        //[...]
    }

The string tag (first parameter) of the _PosTag_ is used as unique key by the 
_TagSet_. Adding an 2nd _PasTag_ with the same tag will override the first one. 
_PosTag_s that are added to a _TagSet_ have the _Tag#getAnnotationModel()_ 
property set to that model.

The final example shows a code snippet shows the core part of an POS tagging 
engine using the both the [AnalyzedText](analyzedtext) and the _PosTag_ and 
_TagSet_ APIs. 

    :::java
    TagSet<PosTag> tagSet; //the used TagSet
    //holds PosTags for tags returned by the POS tagger that
    //are missing in the TagSet
    Map<String,PosTag> adhocTags = new HashMap<String,PosTag>():
    List<Span> token = new ArrayList<Span>(64);

    Iterator<Section> sentences; //Iterator over the sentences

    while(sentences.hasNext()){
        Section sentence = sentences.next();
        //get the tokens of the current sentence
        token.clean();
        AnalysedTextUtils.appandToList(
            sentence.getEnclosed(SpanTypeEnum.Token),
            tokenList);
        //typically one needs also to get the Strings
        //of the tokens for the pos tagger
        String[] tokenText = new String[tokenList.size()];
        for(int i=0;i<tokens.size();i++){
            tokenText[i] = tokens.get(i).getSpan();
        }

        //now POS tag the sentence
        String[] posTags = posTagger.tag(tokens);

        //finally apply the PosTags and save the annotation
        for(int i=0;i<tokens.size();i++){
            PosTag tag = tagSet.get(posTags[i]);
            if(tag == null) { //unmapped tag
                tag = adhocTags.get(posTags[i]);
            }
            if(tag == null) { //unknown tag
                tag = new PosTag(posTags[i]);
                adhocTags.put(posTags[i],tag);
            }
            //add the annotation to the Token
            token.addAnnotation(
                NlpAnnotations.POS_ANNOTATION,
                Value.value(tag));
        }        
    }

### Phrase annotations

Phrase annotations can be used to define the type of a _Chunk_. The _PhraseTag_ 
class is used for phrase annotations. It defines first a string tag and 
secondly the Phrase category. The _LexicalCategory_ enumeration is used as 
valued for the category. As the _PhraseTag_ is a subclass of _Tag_ it can be 
also used in combination with the _TagSet_ class as described in the [PosTag 
and TagSet] section.

The following code snippets show how to create a PhraseTag for noun phrases

    :::java
    PhraseTag tag = new PhraseTag("NP", LexicalCategory.Noun);

  

### Name Entity (NER) annotations

Named Entity annotations are created by NER modules. Before the Stanbol NLP 
chain they where represented in Stanbol by using 
'[fise:TextAnnotation](../enhancementstructure#fisetextannotation)'s and any 
Enhancement Engine that does NER should still support this. With the Stanbol 
NLP processing module it is now also possible to represent detected Named 
Entities as _Chunk_ with an PhraseTag added as Annotation.

A Named Entity represented as 'fise:TextAnnotation' includes the following 
information:

    urn:namedEntity:1 
        rdf:type fise:TextAnnotation, fise:Enhancement
        fise:selected-text {named-entity-text}
        fise:start {start-char-pos}
        fise:end {end-char-pos}
        dc:type {named-entity-type}

where:

* {named-entity-text} is the text recognized as Named Entity. This is the same 
as returned by _Chunk#getSpan()_
* {start-char-pos} is the start character position of the Named Entity relative 
to the start of the text. This is the same as _Chunk#getStart()_
* {end-char-pos} is the end position and the same as _Chunk#getEnd()_
* {named-enttiy-type} is the type of the recognized Named Entity as URI. The 
_PhraseTag allows to define both the string tag as used by the NER component as 
well as the URI this type is mapped to. In Stanbol it is preferred to use 
'dbpedia:Person', 'dbpedia:Organisation' and 'dbpedia:Place' for the according 
entity types.

The _NerTag_ class extends _Tag_ and can therefore be also used with the 
_TagSet_ class. This means that users of the API can use _TagSet_ to manage the 
string tag to URI mappings for the supported Named Entity types.

The following Code Snippets shows how to add NER annotations to the 
AnalysedText:

    :::java
    AnalysedText at; //The AnalysedText
    TagSet<NerTag> nerTags; //registered NER tags
    Iterator<Section> sections; //sections to iterate over

    List<String> tokenTexts = new ArrayList<Span>(64);

    while(sections.hasNext()){
        Section section = sections.next();
        //NER tagger typically need String[] as input
        token.clean();
        Iterator<Token> tokens = section.getTokens;
        while(tokens.hasNext()){
            tokenTexts.add(tokens.next().getSpan());
        }
        //Span -> #start #end #type #probability
        Span[] nerSpans = nerTagger.tag(
            tokenTexts.toArray(new String[tokenTexts.size()]);
        for(int i=0; i < nerSpans.length; i++){
            Chunk namedEntity = at.addChunk(
                nerSpans[i].start,nerSpans[i].start);
            NerTag tag = nerTags.get(nerSpans[i].type)
            if(tag == null){ //unmapped NER
                tag = new NerTag(nerSpans[i].type);
            }
            namedEntity.addAnnotation(
                NlpAnnotations.NER_ANNOTATION,
                Value.value(tag, nerSpans[i]. probability));
        }
    }
   
Note that the above Code Snippet only shows how to add the Named Entity to the 
AnalyzedText ContentPart. A actual NER engine Implementation needs also to add 
those information to the metadata of the [ContentItem](../contentitem).

    :::java
    ContentItem ci; //The processed ContentItem
    Language lang; //The Language of the processed Text
    MGraph metadata = ci.getMetadata();
    Section section; //the current Section
    Chunk namedEntity //the currently processed Named Entity

    Value<NerTag> nerAnnotation = namedEntity.getAnnotation(
        NlpAnnotations.NER_ANNOTATION);

    UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, 
this);
    metadata.add(new TripleImpl(textAnnotation, ENHANCER_SELECTED_TEXT, 
        new PlainLiteralImpl(namedEntity.getSpan(), language)));
    metadata.add.add(new TripleImpl(textAnnotation, ENHANCER_SELECTION_CONTEXT, 
        new PlainLiteralImpl(section.getSpan(), language)));
    if(tag.getType() != null){
        metadata.add(new TripleImpl(textAnnotation, DC_TYPE, 
            nerAnnotation.value().getType));
    } //else do not add an dc:type for unmapped NamedEntities
    g.add(new TripleImpl(textAnnotation, ENHANCER_CONFIDENCE, 
        literalFactory.createTypedLiteral(nerAnnotation.probability())));
    g.add(new TripleImpl(textAnnotation, ENHANCER_START, 
        literalFactory.createTypedLiteral(namedEntity.getStart()));
    g.add(new TripleImpl(textAnnotation, ENHANCER_END, 
        literalFactory.createTypedLiteral(namedEntity.getEnd())));


### Morphological Analyses


__NOTE:__ _This part of the Stanbol NLP annotations is still work in progress. 
So this part of the API might undergo heavy changes even in minor releases._


The results of a Morphological Analyses are represented by the _MorphoFeatures_ 
class and can be added to the analyzed word (_Token_) by using the 
_NlpAnnotations.MORPHO_ANNOTATION_. The _MorphoFeatures_ class provides the 
following features: 

* __Lemma__: A String value representing the lemmatization of the annotated 
Token.
* __Case__: The _Case_ enumeration contains around 70 members defined based on 
concepts of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _CaseTag_ 
allows to define cases and optionally map them to the cases defined by the 
enumeration.
* __Definitness__: The _Definitness_ enumeration has the members Definite and 
Indefinite also defined by Concepts in the [OLiA 
Ontology](http://nlp2rdf.lod2.eu/olia/).
* __Gender__: The _Gender_ enumeration contains the six gender defined by the 
[OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). The _GenderTag_ allows to define 
Genders and optionally map them to the gender defined by the enumeration. 
* __Number__: The _NumberFeature_ enumeration defines the eight number features 
defined by [OLiA](http://nlp2rdf.lod2.eu/olia/). The _NumberTag_ can be used to 
define number features and map them to the members of the enumeration
* __Person__: the _Person_ enumeration has the definitions for 'first', 
'second' and 'third' with mappings to the according concepts of the [OLiA 
Ontology](http://nlp2rdf.lod2.eu/olia/).
* __Tense__: The _Tense_ enumeration represents the tense hierarchy as defined 
by the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). the _Tense#getParent()_ 
allows access to the direct parent of a _Tense_ while the _Tense#getTenses()_ 
method can be used to obtain the transitive closure (including the _Tens_ 
object itself). _TenseTag_ is used for Tense annotations. It allows both to 
parse a string tag representing the tense as well as defining a mapping to the 
tenses defined by the _Tense_ enumeration.
* __Mood__: The _VerbMood_ enumeration currently defines members from different 
part of the [OLiA Ontology](http://nlp2rdf.lod2.eu/olia/). While OLiA does 
define the 'ilia:MoodFeature' class but those members had not a good match with 
verb moods as used by the CELI/linguagrid.org service. For now the decision was 
to define the _VerbMood_ enumeration more closely to the usage of CELI, but 
this needs clearly to be validated as soon as implementations for other NLP 
frameworks are added. Their is also a _VerbMoodTag_ that allows to define verb 
moods by a string tag and an mapping to the _VerbMood_ enumeration.

 
The _MorphoFeatures_ supports multi valued annotations for all the above 
features. Getter for a single value will always return the first added value.
                
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
>                 Key: STANBOL-734
>                 URL: https://issues.apache.org/jira/browse/STANBOL-734
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word 
> granularity - is not feasible using the RDF metadata this describes the 
> addition of a special ContentPart Stanbol. This ContentPart will have the 
> name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
>     AnalysedText at; //the contentPart
>     Iterator<Sentence> sentences = at.getSentences;
>     while(sentences.hasNext){
>         Sentence sentence = sentences.next();
>         String sentText = sentence.getSpan();
>         Iterator<SentenceToken> tokens = sentence.getTokens();
>         while(tokens.hasNext()){
>             Token token = tokens.next();
>             String tokenText = token.getSpan();
>             Value<PosTag> pos = token.getAnnotation(
>                 NlpAnnotations.posAnnotation);
>             String tag = pos.value().getTag();
>             double confidence = pos.probability();
>         }
>     }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
>     TagSet<PosTag> tagSet; //the used TagSet
>     Map<String,PosTag> unknown; //missing tags in the TagSet
>     Token token; //the token
>     String tag; //the detected tag
>     double prob; //the probability
>     PosTag pos = tagset.getTag(tag);
>     if(pos == null){ //unkonw tag
>         pos = unknown.get(tag);
>     }
>     if(pos == null) {
>         pos = new PosTag(tag);
>         //this tag will not have a LexicalCategory
>         unknown.add(pos); //only one instance
>     }
>     token.addAnnotation(
>         NlpAnnotations.POSAnnotation,
>         new Value<PosTag>(pos, prob));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Reply via email to