[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Rupert Westenthaler (JIRA) Mon, 19 Nov 2012 21:46:10 -0800

    [ 
https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500846#comment-13500846
 ]


Rupert Westenthaler commented on STANBOL-734:
---------------------------------------------

Documentation for the in-memory implementation of the AnalyzedText

In-Memory AnalyzedText and Annotation implementation
================

This describes the implementation of the [Analyzed Text](analysedtext) used by 
default by the Stanbol NLP processing module. This implementation is directly 
contained within the org.apache.stanbol.enhancer.nlp module.

## AnalyzedTextFactory

The AnalyzedTextFactory of the in-memory implementation registers itself as 
OSGI service with an "service.ranking" of Integer.MIN_VALUE. That means that 
any other registered AnalyzedTextFactory will override this one (unless it does 
not use Integer.MIN_VALUE itself).

The implementation uses the ContentItemHelper#getText(Blob blob) method to 
retrieve the text from the parsed blob. The text is than used to create an 
AnalyzedText instance.

## AnalyzedText Implementation

The in-memory implementation is based on a NavigableMap that uses the same span 
as both key and value. TreeMap is currently used as implementation. The 
compareTo(..) method of the Span implementation ensures the correct ordering of 
Spans as specified by the [Analyzed Text](analyzedtext) interface. All 
add**(..) methods first check if a span with the added type, [start,end) is 
already contained. If this is the case the current span is returned otherwise 
an new instance is created.

The Iterator implementation is not based on the Iterators provided by the 
NavigableMap as those would throw ConcurrentModificationExceptions - what is 
prohibited by the specification. Instead in implementation that is based on the 
#higherKey() method is used. Filtered Iterators are implemented using Apache 
Commons Collections FilteredIterator utility with an Predicate based on the 
SpanTypeEnum.

## Annotation Implementation

The implementation of the _Annotated_ interface is similar to that of the 
SolrInputDocument. Internally it uses a Map<Object,Object> to store data. When 
a single value is added it is directly store in the map. In case of multiple 
values data are stored in Arrays. Arrays are sorted by an comparator that 
ensures that the value with the highest probability is at index '0'.

Type safety is not checked so creating multiple Annotations with different 
value types that share the same key will cause ClassCastExceptions at runtime. 

                
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
>                 Key: STANBOL-734
>                 URL: https://issues.apache.org/jira/browse/STANBOL-734
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word 
> granularity - is not feasible using the RDF metadata this describes the 
> addition of a special ContentPart Stanbol. This ContentPart will have the 
> name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
>     AnalysedText at; //the contentPart
>     Iterator<Sentence> sentences = at.getSentences;
>     while(sentences.hasNext){
>         Sentence sentence = sentences.next();
>         String sentText = sentence.getSpan();
>         Iterator<SentenceToken> tokens = sentence.getTokens();
>         while(tokens.hasNext()){
>             Token token = tokens.next();
>             String tokenText = token.getSpan();
>             Value<PosTag> pos = token.getAnnotation(
>                 NlpAnnotations.posAnnotation);
>             String tag = pos.value().getTag();
>             double confidence = pos.probability();
>         }
>     }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
>     TagSet<PosTag> tagSet; //the used TagSet
>     Map<String,PosTag> unknown; //missing tags in the TagSet
>     Token token; //the token
>     String tag; //the detected tag
>     double prob; //the probability
>     PosTag pos = tagset.getTag(tag);
>     if(pos == null){ //unkonw tag
>         pos = unknown.get(tag);
>     }
>     if(pos == null) {
>         pos = new PosTag(tag);
>         //this tag will not have a LexicalCategory
>         unknown.add(pos); //only one instance
>     }
>     token.addAnnotation(
>         NlpAnnotations.POSAnnotation,
>         new Value<PosTag>(pos, prob));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-734) ContentPart for NLP data - AnalyzedText

Reply via email to