[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Rupert Westenthaler (JIRA) Mon, 19 Nov 2012 07:13:03 -0800

    [ 
https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500293#comment-13500293
 ]


Rupert Westenthaler commented on STANBOL-734:
---------------------------------------------

Documentation for Analyzed Text

AnalysedText
=====

The AnalysedText is a Java domain model designed to describe NLP processing 
results. It describes of two major parts:

1. Structure of the Text such as text-sections, sentences, chunks and tokens
2. Annotations for the detected parts of the text.


## AnalysetText as ContentPart

Within the Stanbol Enhancer the AnalysedText is used as 
[ContentPart](../contentitem#content-parts) registered with the URI 
<code>urn:stanbol.enhancer:nlp.analysedText</code>

Because of that it can be retrieved by using the following code

    :::java
    AnalysedText at;
    ci.getLock().readLock().lock();
    try {
        at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class);
    } catch (NoSuchPartException e) {
        //not present
        at = null;
    } finally {
        ci.getLock().readLock().unlock();
    }

Components that need to create an AnalysedText instance can do so by using the 
_AnalysedTextFactory_

    :::java
    @Reference
    AnalysedTextFactory atf;

    ContentItem ci; //the contentItem
    AnalysedText at;
    Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob(
        ci, Collections.singelton("text/plain"));
    if(plainTextBlob != null){
        //creates and adds the AnalysedText ContentPart to the ContentItem
        ci.getLock().writeLock().lock();
        try {
            at = atf.createAnalysedText(ci,plainTextBlob.value());
        } finally {
            ci.getLock().writeLock().unlock();
        }
    } else { //no NLP processing possible
        at = null;
    }

If used outside of OSGI users can also use the 
AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory 
instance of the in-memory implementation.


## Structure of the Text

The basic building block of the AnalysedText is the Span. A Span defines type, 
[start,end) as well as the spanText. For the type an enumeration 
(_SpanTypeEnum_) with the members Text, TextSection, Sentence, Chunk and Text. 
[start,end) define the character positions of the Span within the Text where 
the start position is inclusive and the end position is exclusive.

Analog to the type of the Span there are also Java interfaces representing 
those types and providing additional convenience methods. An additional 
_Section_ interface was introduced as common parent for all types that may have 
enclosed Spans. The AnalyzedText is the interface representing 
SpanTypeEnum#Text. The main intension of those Java classes are to have 
convenience methods that ease the use of the API.

### Uniqueness of Spans

A Span is considered equals to an other Span if [start, end) and type are the 
same. The natural oder of Spans is defined by

* smaller start index first
* bigger end index first
* higher ordinal number of the SpanTypeEnum first

This order is used by all Iterators returned by the AnalyzedText API

### Concurrent Modifications and Iterators

Iterators returned by the AnalyzedText API MUST throw 
_ConcurrentModificationException_s but rather reflect changes to the 
underlaying model. While this is not constant with the default behavior of 
Iterators in Java this is central for the effective usage of the AnalyzedText 
API - e.g. when Iterating over Sentences while adding Tokens.

### Code Samples:

The following Code Snippet shows some typical usages of the API:

    :::java
    AnalysedText at; //typically retrieved from the contentPart
    Iterator<Sentence> sentences = at.getSentences;
    while(sentences.hasNext){
        Sentence sentence = sentences.next();
        String sentText = sentence.getSpan();
        Iterator<SentenceToken> tokens = sentence.getTokens();
        while(tokens.hasNext()){
            Token token = tokens.next();
            String tokenText = token.getSpan();
            Value<PosTag> pos = token.getAnnotation(
                NlpAnnotations.posAnnotation);
            String tag = pos.value().getTag();
            double confidence = pos.probability();
        }
    }

Code that adds new Spans looks like follows

    :::java
    //Tokenize an Text
    Iterator<Sentence> sentences = at.getSentences();
    Iterator<? extends Section> sections;
    if(sentences.hasNext()){ //sentence Annotations presnet
        sections = sentences;
    } else { //if no sentences tokenize the text at once
        sections = Collections.singelton(at).iterator();
    }
    //Tokenize the sections
    for(Section section : sentenceList){
        //assuming the Tokenizer returns tokens as 2dim int array
        int[][] tokenSpans = tokenizer.tokenize(section.getSpan());
        for(int ti = 0; ti < tokenSpans.length; ti++){
            Token token = section.addToken(
                tokenSpans[ti][0],tokenSpans[ti][1]);
        }
    }

For all #add**(start,end) methods in the API the parsed start and end indexes 
are relative to the parent (the one the #add**(..) method is called). The 
[start,end) indexes returned by Spans are absolute values. If an #add**(..) 
method is called for a Span '[start,end):type' that already exists than instead 
of an new instance the already existing one is returned.


## Annotation Support

Annotation support is provided by two interfaces _Annotated_ and _Annotation_ 
and the _Value_ class. _Annotated_ provides an API for adding information the 
the annotated object. Those annotations are represented by key value mappings 
where Object is used as key and the _Value_ class for values. The _Value_ class 
provides the generically typed value as well as a double probability in the 
range [0..1] or -1 if not known. Finally the _Annotation_ class is used to 
ensure type safety.

The following example shows the intended usage of the API

1. One needs to define the _Annotations_ one would like to use. Annotations are 
typically defined as public static members of interfaces or classes. The 
following example uses the definition of the Part of Speech annotation.

    :::java
    public interface NlpAnnotations {
        //an Part of Speech Annotation using a String key 
        //and the PosTag class as value
        Annotation<String,PosTag> POS_ANNOTATION = new 
Annotation<String,PosTag>(
            "stanbol.enhancer.nlp.pos", PosTag.class);
        ... 
    }

2. Defined _Annotation_ are used to add information to an _Annotated_ instance 
(like a Span). For adding annotations the use of _Annotation_s is required to 
ensure type safety. The following code snippet shows how to add an PosTag with 
the probability 0.95. 

    :::java
    PosTag tag = new PosTag("N"); //a simple POS tag
    Token token; //The Token we want to add the tag
    token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);

3. For consuming annotations there are two options. First the possibility to 
use the _Annotation_ object and second by directly using the key. While the 2nd 
option is not as nicely to use (as it does not provide type safety) it allows 
consuming annotations without the need to have the used _Annotation_ in the 
classpath. The following examples show both options

    :::java
    Iterator<Token> tokens = sentence.getTokens();
    while(tokens.hasNext){
        Token token = tokens.next();
        //use the POS_ANNOTATION to get the PosTag
        PosTag tag = token.getAnnotation(POS_ANNOTATION);
        if(tag != null){
            log.info("{} has PosTag {}",token,tag.value());
        } else {
            log.infor("{} has no PosTag",token);
        }
        //(2) use the key to retrieve values
        String key = "urn:test-dummy";
        Value<?> value = token.getValue(key);
        //the programmer needs to know the type!
        if(v.probability() > 0.5){
            log.info("{}={}",key,value.value());
        }
    }
    
The _Annotated_ interface supports multi valued annotations. For that it 
defines methods for adding/setting and getting multiple values. Values are 
sorted first by the probability (unknown probability last) and secondly by the 
insert order (first in first out). So calling the single value getAnnotation() 
method on a multi valued field will return the first item (highest probability 
and first added in case of multiple items with the same/no probabilities)

                
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
>                 Key: STANBOL-734
>                 URL: https://issues.apache.org/jira/browse/STANBOL-734
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word 
> granularity - is not feasible using the RDF metadata this describes the 
> addition of a special ContentPart Stanbol. This ContentPart will have the 
> name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
>     AnalysedText at; //the contentPart
>     Iterator<Sentence> sentences = at.getSentences;
>     while(sentences.hasNext){
>         Sentence sentence = sentences.next();
>         String sentText = sentence.getSpan();
>         Iterator<SentenceToken> tokens = sentence.getTokens();
>         while(tokens.hasNext()){
>             Token token = tokens.next();
>             String tokenText = token.getSpan();
>             Value<PosTag> pos = token.getAnnotation(
>                 NlpAnnotations.posAnnotation);
>             String tag = pos.value().getTag();
>             double confidence = pos.probability();
>         }
>     }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
>     TagSet<PosTag> tagSet; //the used TagSet
>     Map<String,PosTag> unknown; //missing tags in the TagSet
>     Token token; //the token
>     String tag; //the detected tag
>     double prob; //the probability
>     PosTag pos = tagset.getTag(tag);
>     if(pos == null){ //unkonw tag
>         pos = unknown.get(tag);
>     }
>     if(pos == null) {
>         pos = new PosTag(tag);
>         //this tag will not have a LexicalCategory
>         unknown.add(pos); //only one instance
>     }
>     token.addAnnotation(
>         NlpAnnotations.POSAnnotation,
>         new Value<PosTag>(pos, prob));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Reply via email to