[jira] [Commented] (STANBOL-736) OpenNLP Chunker Engine

Rupert Westenthaler (JIRA) Wed, 21 Nov 2012 06:14:02 -0800

    [ 
https://issues.apache.org/jira/browse/STANBOL-736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501995#comment-13501995
 ]


Rupert Westenthaler commented on STANBOL-736:
---------------------------------------------

Documentation for this Engine

OpenNLP Chunker Engine
=======

The OpenNLP Chunker Engine support the detection of Phrases (Noun, Verb, ...) 
within the parsed Text. For that it uses the OpenNLP Chunker feature. Detected 
Phrases are added as _Chunk_s to the _[AnalyzedText](../nlp/analyzedtext)_ 
content part. In addition added _Chunk_s are annotated with an [Phrase 
Annotation](../nlp/nlpannotations#phrase-annotations) providing the type of the 
Phrase represented by the _Chunk_.


## Consumed information

* __Language__ (required): The language of the text needs to be available. It 
is read as specified by 
[STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the 
metadata of the ContentItem. Effectively this means that any Stanbol Language 
Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
* __Tokens with POS annotations__ (required): This Engine needs the Text to be 
tokenized and POS tagged. Even more the POS tags need to be compatible with the 
POS tags used to train the Chunker model. This effectively means that this 
Engine will only work as expected if the POS tagging was done by the OpenNLP 
POS Tagging Engine configured with a POS model using the same POS tag set as 
used for training the chunker model. 
* __Sentences__ (optional): In case _Sentence_s are available in the 
_AnalyzedText_ content part the tokenization of the text is done sentence by 
sentence. Otherwise the whole text is tokenized at once.

## Configuration

The OpenNLP Chunker Engine provides a default service instance (configuration 
policy is optional) that is configured to process all languages. For German the 
model parameter is set to 'OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip' a 
chunker model that only detects Noun Phrases. This model is included in the 
'o.a.stanbol.data.opennlp.lang.de' module. This Engine instance uses the name 
'opennlp-chunker' and has a service ranking of '-100'.

This engine supports the default configuration for Enhancement Engines 
including the __name__ _(stanbol.enhancer.engine.name)_ and the __ranking__ 
_(service.ranking)_ In addition it is possible to configure the __processed 
languages__ _(org.apache.stanbol.enhancer.chunker.languages)_ and an parameter 
to specify the name of the chunker model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is 
also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are processed.

Values can be parsed as Array or Vector. This is done by using the 
["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback 
also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single 
configuration.

    org.apache.stanbol.enhancer.chunker.languages=["!fr","!it","de","en","*"]

NOTE that the "processed language" configuration only specifies what languages 
are considered for processing. If "de" is enabled, but there is no sentence 
detection model available for that language, than German text will still not be 
processed. However if there is a POS model for "it" but the "processed 
language" configuration does not include Italian, than Italian text will NOT be 
processed. 

__2. Sentnece detection model parameter__

The OpenNLP Sentence Detection engine supports the 'model' parameter to 
explicitly parse the name of the sentence detection model used for an language. 
Models are loaded via the Stanbol DataFile provider infrastructure. That means 
that models can be loaded from the {stanbol-working-dir}/stanbol/datafiles 
folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

As shown by the default configuration of this engine, to use 
"OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip" for detecting sentences in 
German texts one can use a configuration like follows

    de;model=OpenNLP_1.5.1-German-Chunker-TigerCorps07.zip
    *

By default OpenNLP chunker models are loaded from '{lang}-chunker.bin'. To use 
models with other names users need to use the 'model' parameter as described 
above.

                
> OpenNLP Chunker Engine
> ----------------------
>
>                 Key: STANBOL-736
>                 URL: https://issues.apache.org/jira/browse/STANBOL-736
>             Project: Stanbol
>          Issue Type: Sub-task
>            Reporter: Rupert Westenthaler
>
> This EnhancementEngine requires Sentences and Tokens with POS annotations to 
> be present in the AnalyzedText content part. It uses those information to 
> create chunks and stores them in the AnalyzedText content part.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (STANBOL-736) OpenNLP Chunker Engine

Reply via email to