I am a researcher in NLP and I am trying to use the opennlp resources in
Spanish. I realized some issues.

- First, Why resources in Spanish are not maintained among versions? For
instance, in versions 1.3 and 1.4 we have postag, sentdetect and tokenize
trained bins:

http://maven.tamingtext.com/opennlp-models/models-1.3/spanish/
http://maven.tamingtext.com/opennlp-models/models-1.4/spanish/

while in 1.5, in Spanish, we have only es-ner support:

http://maven.tamingtext.com/opennlp-models/models-1.5/

I tried to use the 1.4 version of SentenceDetector model in an opennlp
version 1.5.3 and it just does not work. Apparently, binaries are
different*.

- Second, What is the correct way of invoking SentenceDetector with UTF-8
support.

I created a model for sentdetec in Spanish based in the conll02 as
suggested in a post**. I used "-encoding UTF-8" argument (and input text is
in UTF-8 as well)***. However, the output file is an ASCII text and is
showing "??" instead of diacritics vowels (like á, é, í, ó, ú) ****.

Could you help me please


* amolina@server:~/apache-opennlp-1.5.3$ opennlp SentenceDetector
models/SpanishSent.bin < ../Corpus/text.txt
Loading Sentence Detector model ... Exception in thread "main"
java.lang.NullPointerException
    at
opennlp.tools.util.model.BaseModel.getManifestProperty(BaseModel.java:491)
    at
opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:245)
    at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:237)
    at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:181)
    at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:95)
    at
opennlp.tools.cmdline.sentdetect.SentenceModelLoader.loadModel(SentenceModelLoader.java:41)
    at
opennlp.tools.cmdline.sentdetect.SentenceModelLoader.loadModel(SentenceModelLoader.java:32)
    at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:62)
    at
opennlp.tools.cmdline.sentdetect.SentenceDetectorTool.run(SentenceDetectorTool.java:58)
    at opennlp.tools.cmdline.CLI.main(CLI.java:225)

**
http://mail-archives.apache.org/mod_mbox/opennlp-dev/201202.mbox/%[email protected]%3E

*** amolina@server:~$ opennlp SentenceDetectorTrainer -model esp-sent.bin
-lang es -data traindata-opennlp/esp-sent-UTF-8.train -encoding UTF-8
-iterations 1000
1:  ... loglikelihood=-6970.98119489168    0.718106791289649
...
1000:  ... loglikelihood=-115.51224352128054    0.9976136024659441
Writing sentence detector model ... done (0.067s)

**** amolina@server:~/apache-opennlp-1.5.3$ opennlp SentenceDetector
models/esp-sent.bin < ../Corpus/text.txt > outest
Loading Sentence Detector model ... done (0.037s)

amolina@server:~/apache-opennlp-1.5.3$ file outest
outest: ASCII text, with very long lines

amolina@server:~/apache-opennlp-1.5.3$ cat outest
La cerveza negra (en alem??n Schwarzbier) es un tipo de cerveza lager
alemana opaca, de color muy oscuro y sabor fuerte que recuerda al chocolate
o al caf??. Aunque tienen un sabor parecido, son m??s suaves y menos
amargas que las stouts o porters brit??nicas, debido al uso de levadura
lager en lugar de ale y a la omisi??n de la cebada.



-- 
Alejandro Molina <http://molina.talne.eu>,

Laboratoire Informatique d'Avignon / Université d'Avignon
339 chemin des Meinajaries, Agroparc BP 1228, 84911 Avignon cedex 9,  FRANCE
Tél : (+33) 06 65 12 02 74

Reply via email to