On 09/17/2014 12:16 PM, Alejandro Molina wrote:
I am a researcher in NLP and I am trying to use the opennlp resources in
Spanish. I realized some issues.
- First, Why resources in Spanish are not maintained among versions? For
instance, in versions 1.3 and 1.4 we have postag, sentdetect and tokenize
trained bins:
http://maven.tamingtext.com/opennlp-models/models-1.3/spanish/
http://maven.tamingtext.com/opennlp-models/models-1.4/spanish/
while in 1.5, in Spanish, we have only es-ner support:
http://maven.tamingtext.com/opennlp-models/models-1.5/
I tried to use the 1.4 version of SentenceDetector model in an opennlp
version 1.5.3 and it just does not work. Apparently, binaries are
different*.
The problem is that OpenNLP must be trained on a corpus to produce the
statistical models
you refer to above. That usually includes the writing of a parser for
the corpus format.
The format conversion script which was written to train the Spanish
models was never
released as part of OpenNLP and therefore is not available to us.
In 1.5 a couple of changes were included which requires that models have
to be retrained and sadly
we couldn't do that for the Spanish models.
Another issues is that the Apache OpenNLP project can only release
artifacts which fulfill certain
license requirements (e.g. licensed under Apache License 2.0, or a
compatible license) otherwise
the project can't distribute those artifacts.
The license for the corpora are often very restrictive and not
compatible with the AL 2.0. It might be the
case that we can anyway release trained models, because they don't
contain the corpus, but in order to be sure
that is is allowed we would need to work on this legal issues on a per
corpus basis.
To circumvent any legal issues we decided to release the corpus format
parsing code as part of OpenNLP and allow
our users to train their own models. I think there is support for some
Spanish corpora in the latest head version.
You are welcome to contribute format parsing code to OpenNLP to support
the corpus of your choice.
- Second, What is the correct way of invoking SentenceDetector with UTF-8
support.
I created a model for sentdetec in Spanish based in the conll02 as
suggested in a post**. I used "-encoding UTF-8" argument (and input text is
in UTF-8 as well)***. However, the output file is an ASCII text and is
showing "??" instead of diacritics vowels (like á, é, í, ó, ú) ****.
OpenNLP will decode the input file with the specified encoding. Any, the
text
has again to to be encoded to be written to the console. And as far as I
recall there platform
default encoding is used for that. But the details on how that work
might also differ from platform
to platform.
Which OS are you running? I usually run the command line tools on Linux with
UTF-8 as the default encoding, that combination never seems to output
wrong characters.
I once saw a similar problem on windows with Japanese text, the output
of the text consisted mostly
out of question mark chars.
HTH,
Jörn