I am a researcher in NLP and I am trying to use the opennlp resources in Spanish. I realized some issues.
- First, Why resources in Spanish are not maintained among versions? For instance, in versions 1.3 and 1.4 we have postag, sentdetect and tokenize trained bins: http://maven.tamingtext.com/opennlp-models/models-1.3/spanish/ http://maven.tamingtext.com/opennlp-models/models-1.4/spanish/ while in 1.5, in Spanish, we have only es-ner support: http://maven.tamingtext.com/opennlp-models/models-1.5/ I tried to use the 1.4 version of SentenceDetector model in an opennlp version 1.5.3 and it just does not work. Apparently, binaries are different*. - Second, What is the correct way of invoking SentenceDetector with UTF-8 support. I created a model for sentdetec in Spanish based in the conll02 as suggested in a post**. I used "-encoding UTF-8" argument (and input text is in UTF-8 as well)***. However, the output file is an ASCII text and is showing "??" instead of diacritics vowels (like á, é, í, ó, ú) ****. Could you help me please * amolina@server:~/apache-opennlp-1.5.3$ opennlp SentenceDetector models/SpanishSent.bin < ../Corpus/text.txt Loading Sentence Detector model ... Exception in thread "main" java.lang.NullPointerException at opennlp.tools.util.model.BaseModel.getManifestProperty(BaseModel.java:491) at opennlp.tools.util.model.BaseModel.initializeFactory(BaseModel.java:245) at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:237) at opennlp.tools.util.model.BaseModel.<init>(BaseModel.java:181) at opennlp.tools.sentdetect.SentenceModel.<init>(SentenceModel.java:95) at opennlp.tools.cmdline.sentdetect.SentenceModelLoader.loadModel(SentenceModelLoader.java:41) at opennlp.tools.cmdline.sentdetect.SentenceModelLoader.loadModel(SentenceModelLoader.java:32) at opennlp.tools.cmdline.ModelLoader.load(ModelLoader.java:62) at opennlp.tools.cmdline.sentdetect.SentenceDetectorTool.run(SentenceDetectorTool.java:58) at opennlp.tools.cmdline.CLI.main(CLI.java:225) ** http://mail-archives.apache.org/mod_mbox/opennlp-dev/201202.mbox/%[email protected]%3E *** amolina@server:~$ opennlp SentenceDetectorTrainer -model esp-sent.bin -lang es -data traindata-opennlp/esp-sent-UTF-8.train -encoding UTF-8 -iterations 1000 1: ... loglikelihood=-6970.98119489168 0.718106791289649 ... 1000: ... loglikelihood=-115.51224352128054 0.9976136024659441 Writing sentence detector model ... done (0.067s) **** amolina@server:~/apache-opennlp-1.5.3$ opennlp SentenceDetector models/esp-sent.bin < ../Corpus/text.txt > outest Loading Sentence Detector model ... done (0.037s) amolina@server:~/apache-opennlp-1.5.3$ file outest outest: ASCII text, with very long lines amolina@server:~/apache-opennlp-1.5.3$ cat outest La cerveza negra (en alem??n Schwarzbier) es un tipo de cerveza lager alemana opaca, de color muy oscuro y sabor fuerte que recuerda al chocolate o al caf??. Aunque tienen un sabor parecido, son m??s suaves y menos amargas que las stouts o porters brit??nicas, debido al uso de levadura lager en lugar de ale y a la omisi??n de la cebada. -- Alejandro Molina <http://molina.talne.eu>, Laboratoire Informatique d'Avignon / Université d'Avignon 339 chemin des Meinajaries, Agroparc BP 1228, 84911 Avignon cedex 9, FRANCE Tél : (+33) 06 65 12 02 74
