On 09/17/2014 12:16 PM, Alejandro Molina wrote:
I am a researcher in NLP and I am trying to use the opennlp resources in
Spanish. I realized some issues.

- First, Why resources in Spanish are not maintained among versions? For
instance, in versions 1.3 and 1.4 we have postag, sentdetect and tokenize
trained bins:

http://maven.tamingtext.com/opennlp-models/models-1.3/spanish/
http://maven.tamingtext.com/opennlp-models/models-1.4/spanish/

while in 1.5, in Spanish, we have only es-ner support:

http://maven.tamingtext.com/opennlp-models/models-1.5/

I tried to use the 1.4 version of SentenceDetector model in an opennlp
version 1.5.3 and it just does not work. Apparently, binaries are
different*.

The problem is that OpenNLP must be trained on a corpus to produce the statistical models you refer to above. That usually includes the writing of a parser for the corpus format. The format conversion script which was written to train the Spanish models was never
released as part of OpenNLP and therefore is not available to us.
In 1.5 a couple of changes were included which requires that models have to be retrained and sadly
we couldn't do that for the Spanish models.

Another issues is that the Apache OpenNLP project can only release artifacts which fulfill certain license requirements (e.g. licensed under Apache License 2.0, or a compatible license) otherwise
the project can't distribute those artifacts.

The license for the corpora are often very restrictive and not compatible with the AL 2.0. It might be the case that we can anyway release trained models, because they don't contain the corpus, but in order to be sure that is is allowed we would need to work on this legal issues on a per corpus basis.

To circumvent any legal issues we decided to release the corpus format parsing code as part of OpenNLP and allow our users to train their own models. I think there is support for some Spanish corpora in the latest head version. You are welcome to contribute format parsing code to OpenNLP to support the corpus of your choice.

- Second, What is the correct way of invoking SentenceDetector with UTF-8
support.

I created a model for sentdetec in Spanish based in the conll02 as
suggested in a post**. I used "-encoding UTF-8" argument (and input text is
in UTF-8 as well)***. However, the output file is an ASCII text and is
showing "??" instead of diacritics vowels (like á, é, í, ó, ú) ****.

OpenNLP will decode the input file with the specified encoding. Any, the text has again to to be encoded to be written to the console. And as far as I recall there platform default encoding is used for that. But the details on how that work might also differ from platform
to platform.

Which OS are you running? I usually run the command line tools on Linux with
UTF-8 as the default encoding, that combination never seems to output wrong characters.

I once saw a similar problem on windows with Japanese text, the output of the text consisted mostly
out of question mark chars.

HTH,
Jörn

Reply via email to