Re: Spanish support

Jörn Kottmann Thu, 18 Sep 2014 07:04:40 -0700

On 09/17/2014 12:16 PM, Alejandro Molina wrote:

I am a researcher in NLP and I am trying to use the opennlp resources in
Spanish. I realized some issues.


- First, Why resources in Spanish are not maintained among versions? For
instance, in versions 1.3 and 1.4 we have postag, sentdetect and tokenize
trained bins:

http://maven.tamingtext.com/opennlp-models/models-1.3/spanish/
http://maven.tamingtext.com/opennlp-models/models-1.4/spanish/

while in 1.5, in Spanish, we have only es-ner support:

http://maven.tamingtext.com/opennlp-models/models-1.5/

I tried to use the 1.4 version of SentenceDetector model in an opennlp
version 1.5.3 and it just does not work. Apparently, binaries are
different*.

The problem is that OpenNLP must be trained on a corpus to produce thestatistical modelsyou refer to above. That usually includes the writing of a parser forthe corpus format.The format conversion script which was written to train the Spanishmodels was never

released as part of OpenNLP and therefore is not available to us.

In 1.5 a couple of changes were included which requires that models haveto be retrained and sadly

we couldn't do that for the Spanish models.

Another issues is that the Apache OpenNLP project can only releaseartifacts which fulfill certainlicense requirements (e.g. licensed under Apache License 2.0, or acompatible license) otherwise

the project can't distribute those artifacts.

The license for the corpora are often very restrictive and notcompatible with the AL 2.0. It might be thecase that we can anyway release trained models, because they don'tcontain the corpus, but in order to be surethat is is allowed we would need to work on this legal issues on a percorpus basis.

To circumvent any legal issues we decided to release the corpus formatparsing code as part of OpenNLP and allowour users to train their own models. I think there is support for someSpanish corpora in the latest head version.You are welcome to contribute format parsing code to OpenNLP to supportthe corpus of your choice.

- Second, What is the correct way of invoking SentenceDetector with UTF-8
support.

I created a model for sentdetec in Spanish based in the conll02 as
suggested in a post**. I used "-encoding UTF-8" argument (and input text is
in UTF-8 as well)***. However, the output file is an ASCII text and is
showing "??" instead of diacritics vowels (like á, é, í, ó, ú) ****.

OpenNLP will decode the input file with the specified encoding. Any, thetexthas again to to be encoded to be written to the console. And as far as Irecall there platformdefault encoding is used for that. But the details on how that workmight also differ from platform

to platform.

Which OS are you running? I usually run the command line tools on Linux with

UTF-8 as the default encoding, that combination never seems to outputwrong characters.

I once saw a similar problem on windows with Japanese text, the outputof the text consisted mostly

out of question mark chars.

HTH,
Jörn

Re: Spanish support

Reply via email to