[jira] [Comment Edited] (STANBOL-795) OpenNLP Tokenizer Engine

Rupert Westenthaler (JIRA) Wed, 21 Nov 2012 05:26:04 -0800

    [ 
https://issues.apache.org/jira/browse/STANBOL-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501935#comment-13501935
 ]


Rupert Westenthaler edited comment on STANBOL-795 at 11/21/12 1:24 PM:
-----------------------------------------------------------------------

Documentation for this engine

OpenNLP Tokenizer Engine
===========

The OpenNLP Tokenizer Engine adds _Token_s to the _AnalyzedText_ content part. 
If this content part is not yet present it adds it to the ContentItem.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It 
is read as specified by 
[STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the 
metadata of the ContentItem. Effectively this means that any Stanbol Language 
Detection engine will need to be executed before the OpenNLP POS Tagging Engine.
* __Sentences__ (optional): In case _Sentence_s are available in the 
_AnalyzedText_ content part the tokenization of the text is done sentence by 
sentence. Otherwise the whole text is tokenized at once.

## Configuration

The OpenNLP Tokenizer engine provides a default service instance (configuration 
policy is optional). This instance processes all languages. Language specific 
tokenizer models are used if available. For other languages the OpenNLP 
SIMPLE_TOKENIZER is used. This Engine instance uses the name 'opennlp-token' 
and has a service ranking of '-100'.

While this engine supports the default configuration including the __name__ 
_(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ the 
engine also allows to configure __processed languages__ 
_(org.apache.stanbol.enhancer.token.languages)_ and an parameter to specify the 
name of the tokenizer model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is 
also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are tokenized.

Values can be parsed as Array or Vector. This is done by using the 
["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback 
also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single 
configuration.

    org.apache.stanbol.enhancer.token.languages=["!fr","!it","de","en","*"]

__2. Tokenizer model parameter__

The OpenNLP Tokenizer engine supports the 'model' parameter to explicitly parse 
the name of the Tokenizer model used for an language. Tokenizer models are 
loaded via the Stanbol DataFile provider infrastructure. That means that models 
can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a 
configuration like follows

    de;model=my-de-pos-model.zip
    *

To configure that the SIMPLE_TOKENIZER should be used for a given language the 
'model' parameter needs to be set to 'SIMPLE' as shown in the following example

    de;model=SIMPLE
    *

By default OpenNLP Tokenizer models are loaded for the names 
'{lang}-pos-perceptron.bin' and '{lang}-pos-maxent.bin' to use models with 
other names users need to use the 'model' parameter as described above.
                
      was (Author: rwesten):
    Documentation for this engine

OpenNLP Tokenizer Engine
===========

The OpenNLP Tokenizer Engine adds _Token_s to the _AnalyzedText_ content part. 
If this content part is not yet present it adds it to the ContentItem.

## Consumed information

* __Language__ (required): The language of the text needs to be available. It 
is read as specified by 
[STANBOL-613](https://issues.apache.org/jira/browse/STANBOL-613) from the 
metadata of the ContentItem. Effectively this means that any Stanbol Language 
Detection engine will need to be executed before the OpenNLP POS Tagging Engine.


## Configuration

The OpenNLP Tokenizer engine provides a default service instance (configuration 
policy is optional). This instance processes all languages. Language specific 
tokenizer models are used if available. For other languages the OpenNLP 
SIMPLE_TOKENIZER is used. This Engine instance uses the name 'opennlp-token' 
and has a service ranking of '-100'.

While this engine supports the default configuration including the __name__ 
_(stanbol.enhancer.engine.name)_ and the __ranking__ _(service.ranking)_ the 
engine also allows to configure __processed languages__ 
_(org.apache.stanbol.enhancer.token.languages)_ and an parameter to specify the 
name of the tokenizer model used for a language.

__1. Processed Language Configuraiton:__

For the configuration of the processed languages the following syntax is used:

    de
    en
    
This would configure the Engine to only process German and English texts. It is 
also possible to explicitly exclude languages

    !fr
    !it
    *

This specifies that all Languages other than French and Italien are tokenized.

Values can be parsed as Array or Vector. This is done by using the 
["elem1","elem2",...] syntax as defined by OSGI ".config" files. As fallback 
also ',' separated Strings are supported. 

The following example shows the two above examples combined to a single 
configuration.

    org.apache.stanbol.enhancer.token.languages=["!fr","!it","de","en","*"]

__2. Tokenizer model parameter__

The OpenNLP Tokenizer engine supports the 'model' parameter to explicitly parse 
the name of the Tokenizer model used for an language. Tokenizer models are 
loaded via the Stanbol DataFile provider infrastructure. That means that models 
can be loaded from the {stanbol-working-dir}/stanbol/datafiles folder.

The syntax for parameters is as follows

    {language};{param-name}={param-value}

So to use the "my-de-pos-model.zip" for POS tagging German texts one can use a 
configuration like follows

    de;model=my-de-pos-model.zip
    *

To configure that the SIMPLE_TOKENIZER should be used for a given language the 
'model' parameter needs to be set to 'SIMPLE' as shown in the following example

    de;model=SIMPLE
    *

By default OpenNLP Tokenizer models are loaded for the names 
'{lang}-pos-perceptron.bin' and '{lang}-pos-maxent.bin' to use models with 
other names users need to use the 'model' parameter as described above.
                  
> OpenNLP Tokenizer Engine
> ------------------------
>
>                 Key: STANBOL-795
>                 URL: https://issues.apache.org/jira/browse/STANBOL-795
>             Project: Stanbol
>          Issue Type: Sub-task
>          Components: Enhancer
>            Reporter: Rupert Westenthaler
>            Assignee: Rupert Westenthaler
>
> Implement an separate OpenNLP Tokenizer Engine.
> While some Engines like the OpenNLP POS or the CELI Lemmatizer engine do 
> support tokenizing (if tokens do not already exist in the Analyzed Text) it 
> is important to implement an engine explicitly for this task.
> This engine also supports the language configuration (see following example)
>     en;model=SIMPLE
>     de;model=mySpecificTokenizerModel_de.bin
>     !jp
>     !zh
>     *
> the 'model' parameter can be used to load specific tokenizer models. "SIMPLE" 
> forces the use of the OpenNLP SimpleTokenizer. If no model configuration is 
> present the default tokenizer for the language is loaded ("{lang}-token.bin" 
> or the simple tokenizer if the language model is not present).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (STANBOL-795) OpenNLP Tokenizer Engine

Reply via email to