Re: Custom models and Open Calais Engines

Rupert Westenthaler Mon, 22 Sep 2014 02:22:42 -0700

Hi Ghufran,


On Mon, Sep 22, 2014 at 10:42 AM, Mohammad Ghufran <emghuf...@gmail.com> wrote:
> Hello,
>
> I am interested in using Stanbol as part of my Research project but I am
> having trouble handling languages other than English. I realize that this
> list is for development and my questions may not be 100% relevant to
> development, but this is the best place I could find to ask for help. I'd
> appreciate if someone can guide me a little given that documentation is
> quite sparse!
>
> I am primarily interested in doing named entity recognition in multiple
> languages (French, and English mostly). For this, I found a model for
> french built by someone here:
> http://enicolashernandez.blogspot.fr/2012/12/apache-opennlp-fr-models.html
> . Models for all the tasks including segmentation, tokenization, POS, and
> NER for French can be found here. What I am unable to achieve is to
> successfully use these models. From what I gather, all the external models
> should be put inside the {install-directory}/stanbol/datafiles directory.

Thats correct. If you copy the models in this directory they can be
found by Stanbol.

However the OpenNLP modules do use specific name patterns for model
files. So make sure that your custom models do follow such name
schemes:

* Sentence: {lang}-sent.bin (e.g. "fr-sent-bin")
* Token: {lang}-token.bin (e.g. "fr-token.bin")
* Pos: {lang}-pos-perceptron.bin or {lang}-pos-maxent.bin depending on
if you use a perceptron or maxent model (e.g."fr-pos-maxent.bin")
* Chunker: {lang}-chunker.bin (e.g. "fr-chunker.bin")
* Namefinder: {lang}-ner-{type}.bin. The default types are
    * person (e.g. "fr-ner-person.bin")
    * location (e.g. "fr-ner-location.bin")
    * organization (e.g. "fr-ner-organization.bin")
    * for other types see
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlpcustomner

You can use models with other names, but in this case you will need to
add explicit configurations with the used names to the engines using
those. If you want to opt for this please note the documentation of
the engines.

* Sentence Detection:
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlpsentence
* Tokenization:
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlptokenizer
* Pos Tagging: 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlppos
* Chunking: 
http://stanbol.apache.org/docs/trunk/components/enhancer/engines/opennlpchunker

all those engines do allow to configure processed languages. Via the
`model` parameter of a language you can set the name of the model file
(located in the `stanbol/datafile/` folder)

Hope this solves you issue
best
Rupert

> However, when I create a chain with the new components, I get an error that
> one of the models was not found (this seems to be arbitrary since all the
> models are in the same location but the error doesn't occur for all the
> models. For example, sentence segmentation with the french model seems to
> work fine but tokenization fails). Could someone please help me with how to
> set up models other languages? Inside the opennlp directory, there are
> folders for 'lang' and 'ner', what are these for precisely?
>
> Secondly, I also wanted to investigate using OpenCalais enhancement engine.
> There is limited documentation about this which says that an API key must
> be obtained. However, I don't see any enhancement engine corresponding to
> OpenCalais in the OSGi console. Could someone please suggest how I could
> proceed with configuring this engine?
>
> I have compiled Apache Stanbol from source.
>
> Best Regards and thanks in advance!
> Ghufran



-- 
| Rupert Westenthaler             rupert.westentha...@gmail.com
| Bodenlehenstraße 11                              ++43-699-11108907
| A-5500 Bischofshofen
| REDLINK.CO 
..........................................................................
| http://redlink.co/

Re: Custom models and Open Calais Engines

Reply via email to