[
https://issues.apache.org/jira/browse/STANBOL-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13500293#comment-13500293
]
Rupert Westenthaler commented on STANBOL-734:
---------------------------------------------
Documentation for Analyzed Text
AnalysedText
=====
The AnalysedText is a Java domain model designed to describe NLP processing
results. It describes of two major parts:
1. Structure of the Text such as text-sections, sentences, chunks and tokens
2. Annotations for the detected parts of the text.
## AnalysetText as ContentPart
Within the Stanbol Enhancer the AnalysedText is used as
[ContentPart](../contentitem#content-parts) registered with the URI
<code>urn:stanbol.enhancer:nlp.analysedText</code>
Because of that it can be retrieved by using the following code
:::java
AnalysedText at;
ci.getLock().readLock().lock();
try {
at = ci.getPart(AnalysedText.ANALYSED_TEXT_URI, AnalysedText.class);
} catch (NoSuchPartException e) {
//not present
at = null;
} finally {
ci.getLock().readLock().unlock();
}
Components that need to create an AnalysedText instance can do so by using the
_AnalysedTextFactory_
:::java
@Reference
AnalysedTextFactory atf;
ContentItem ci; //the contentItem
AnalysedText at;
Entry<String,Blob> plainTextBlob = ContentItemHelper.getBlob(
ci, Collections.singelton("text/plain"));
if(plainTextBlob != null){
//creates and adds the AnalysedText ContentPart to the ContentItem
ci.getLock().writeLock().lock();
try {
at = atf.createAnalysedText(ci,plainTextBlob.value());
} finally {
ci.getLock().writeLock().unlock();
}
} else { //no NLP processing possible
at = null;
}
If used outside of OSGI users can also use the
AnalysedTextFactory#getDefaultInstance() to obtain the AnalysedTextFactory
instance of the in-memory implementation.
## Structure of the Text
The basic building block of the AnalysedText is the Span. A Span defines type,
[start,end) as well as the spanText. For the type an enumeration
(_SpanTypeEnum_) with the members Text, TextSection, Sentence, Chunk and Text.
[start,end) define the character positions of the Span within the Text where
the start position is inclusive and the end position is exclusive.
Analog to the type of the Span there are also Java interfaces representing
those types and providing additional convenience methods. An additional
_Section_ interface was introduced as common parent for all types that may have
enclosed Spans. The AnalyzedText is the interface representing
SpanTypeEnum#Text. The main intension of those Java classes are to have
convenience methods that ease the use of the API.
### Uniqueness of Spans
A Span is considered equals to an other Span if [start, end) and type are the
same. The natural oder of Spans is defined by
* smaller start index first
* bigger end index first
* higher ordinal number of the SpanTypeEnum first
This order is used by all Iterators returned by the AnalyzedText API
### Concurrent Modifications and Iterators
Iterators returned by the AnalyzedText API MUST throw
_ConcurrentModificationException_s but rather reflect changes to the
underlaying model. While this is not constant with the default behavior of
Iterators in Java this is central for the effective usage of the AnalyzedText
API - e.g. when Iterating over Sentences while adding Tokens.
### Code Samples:
The following Code Snippet shows some typical usages of the API:
:::java
AnalysedText at; //typically retrieved from the contentPart
Iterator<Sentence> sentences = at.getSentences;
while(sentences.hasNext){
Sentence sentence = sentences.next();
String sentText = sentence.getSpan();
Iterator<SentenceToken> tokens = sentence.getTokens();
while(tokens.hasNext()){
Token token = tokens.next();
String tokenText = token.getSpan();
Value<PosTag> pos = token.getAnnotation(
NlpAnnotations.posAnnotation);
String tag = pos.value().getTag();
double confidence = pos.probability();
}
}
Code that adds new Spans looks like follows
:::java
//Tokenize an Text
Iterator<Sentence> sentences = at.getSentences();
Iterator<? extends Section> sections;
if(sentences.hasNext()){ //sentence Annotations presnet
sections = sentences;
} else { //if no sentences tokenize the text at once
sections = Collections.singelton(at).iterator();
}
//Tokenize the sections
for(Section section : sentenceList){
//assuming the Tokenizer returns tokens as 2dim int array
int[][] tokenSpans = tokenizer.tokenize(section.getSpan());
for(int ti = 0; ti < tokenSpans.length; ti++){
Token token = section.addToken(
tokenSpans[ti][0],tokenSpans[ti][1]);
}
}
For all #add**(start,end) methods in the API the parsed start and end indexes
are relative to the parent (the one the #add**(..) method is called). The
[start,end) indexes returned by Spans are absolute values. If an #add**(..)
method is called for a Span '[start,end):type' that already exists than instead
of an new instance the already existing one is returned.
## Annotation Support
Annotation support is provided by two interfaces _Annotated_ and _Annotation_
and the _Value_ class. _Annotated_ provides an API for adding information the
the annotated object. Those annotations are represented by key value mappings
where Object is used as key and the _Value_ class for values. The _Value_ class
provides the generically typed value as well as a double probability in the
range [0..1] or -1 if not known. Finally the _Annotation_ class is used to
ensure type safety.
The following example shows the intended usage of the API
1. One needs to define the _Annotations_ one would like to use. Annotations are
typically defined as public static members of interfaces or classes. The
following example uses the definition of the Part of Speech annotation.
:::java
public interface NlpAnnotations {
//an Part of Speech Annotation using a String key
//and the PosTag class as value
Annotation<String,PosTag> POS_ANNOTATION = new
Annotation<String,PosTag>(
"stanbol.enhancer.nlp.pos", PosTag.class);
...
}
2. Defined _Annotation_ are used to add information to an _Annotated_ instance
(like a Span). For adding annotations the use of _Annotation_s is required to
ensure type safety. The following code snippet shows how to add an PosTag with
the probability 0.95.
:::java
PosTag tag = new PosTag("N"); //a simple POS tag
Token token; //The Token we want to add the tag
token.addAnnotations(POS_ANNOTATION,Value.value(tag),0.95);
3. For consuming annotations there are two options. First the possibility to
use the _Annotation_ object and second by directly using the key. While the 2nd
option is not as nicely to use (as it does not provide type safety) it allows
consuming annotations without the need to have the used _Annotation_ in the
classpath. The following examples show both options
:::java
Iterator<Token> tokens = sentence.getTokens();
while(tokens.hasNext){
Token token = tokens.next();
//use the POS_ANNOTATION to get the PosTag
PosTag tag = token.getAnnotation(POS_ANNOTATION);
if(tag != null){
log.info("{} has PosTag {}",token,tag.value());
} else {
log.infor("{} has no PosTag",token);
}
//(2) use the key to retrieve values
String key = "urn:test-dummy";
Value<?> value = token.getValue(key);
//the programmer needs to know the type!
if(v.probability() > 0.5){
log.info("{}={}",key,value.value());
}
}
The _Annotated_ interface supports multi valued annotations. For that it
defines methods for adding/setting and getting multiple values. Values are
sorted first by the probability (unknown probability last) and secondly by the
insert order (first in first out). So calling the single value getAnnotation()
method on a multi valued field will return the first item (highest probability
and first added in case of multiple items with the same/no probabilities)
> ContentPart for NLP data - AnalyzedText
> ---------------------------------------
>
> Key: STANBOL-734
> URL: https://issues.apache.org/jira/browse/STANBOL-734
> Project: Stanbol
> Issue Type: Sub-task
> Reporter: Rupert Westenthaler
> Assignee: Rupert Westenthaler
>
> Because the management of NLP metadata - that is usually available on a word
> granularity - is not feasible using the RDF metadata this describes the
> addition of a special ContentPart Stanbol. This ContentPart will have the
> name AnalysedText.
> AnalysedText
> =====
> * It wraps the text/plain ContentPart of a ContentItem
> * It allows the definition of Spans (type, start, end, spanText). Type
> is an Enum: Text, TextSection, Sentence, Chunk, Span
> * Spans are sorted naturally by type, start and end. This allows to
> use a NavigateableSet (e.g. TreeSet) and the #subSet() functionality
> to work with contained Tokens. The #higher and #lower methods of
> NavigateableSet even allow to build Iterators that allow concurrent
> modifications (e.g adding Chunks while iterating over the Tokens of a
> Sentence).
> * One can attach Annotations to Spans. Basically a multi-valued Map
> with Object keys and Value<valueType> value(s) that support a type
> save view by using generically typed Annotation<key,valueType>
> * The Value<valueType> object natively supports confidence. This
> allows (e.g. for POS tags) to use the same instance ( e.g. of the POS
> tag for Noun) to be used for all noun annotations.
> * Note that the AnalysedText does NOT use RDF as representing those
> kind of data as RDF is not scaleable enough. This also means that the
> data of the AnalysedText are NOT available in the Enhancement Metadata
> of the ContentItem. However EnhancementEngines are free to write
> all/some results to the AnalysedText AND the RDF metadata of the
> ContentItem.
> Here is a sample code
> AnalysedText at; //the contentPart
> Iterator<Sentence> sentences = at.getSentences;
> while(sentences.hasNext){
> Sentence sentence = sentences.next();
> String sentText = sentence.getSpan();
> Iterator<SentenceToken> tokens = sentence.getTokens();
> while(tokens.hasNext()){
> Token token = tokens.next();
> String tokenText = token.getSpan();
> Value<PosTag> pos = token.getAnnotation(
> NlpAnnotations.posAnnotation);
> String tag = pos.value().getTag();
> double confidence = pos.probability();
> }
> }
> NLP annotations
> =====
> * TagSet and Tag<tagType>: A TagSet can be used for 1..n languages and
> contains Tags of a specific generic type. The Tag only defines a
> String "tag" property
> * Currently Tags for POS (PosTag) and Chunking (PhraseTag) are
> defined. Both define also an optional LexicalCategory. This is a enum
> with the 12 top level concepts defined by the
> [Olia](http://nlp2rdf.lod2.eu/olia/) ontology (e.g. Noun, Verb,
> Adjective, Adposition, Adverb ...)
> * TagSets (including mapped LexicalCategories) are defined for all
> languages where POS taggers are available for OpenNLP. This includes
> also the "penn.owl", "stts.owl" and "parole_es_cat.owl" provided by
> OLIA. The other TagSets used by OpenNLP are currently not available by
> Olia.
> * Note that the LexicalCategory can be used to process POS annotations
> of different languages
> TagSet:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/TagSet.java
> POS:
> https://bitbucket.org/srfgkmt/stanbol-nlp/src/b064095a1b56/stanbol-enhancer-nlp/src/main/java/org/apache/stanbol/enhancer/nlp/pos
> A code sample:
> TagSet<PosTag> tagSet; //the used TagSet
> Map<String,PosTag> unknown; //missing tags in the TagSet
> Token token; //the token
> String tag; //the detected tag
> double prob; //the probability
> PosTag pos = tagset.getTag(tag);
> if(pos == null){ //unkonw tag
> pos = unknown.get(tag);
> }
> if(pos == null) {
> pos = new PosTag(tag);
> //this tag will not have a LexicalCategory
> unknown.add(pos); //only one instance
> }
> token.addAnnotation(
> NlpAnnotations.POSAnnotation,
> new Value<PosTag>(pos, prob));
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira