I: openNLP best practices - sentence detector

Gabriele Vaccari Thu, 07 Dec 2017 11:35:17 -0800

Hi all,
I sent this message to the users mailing list but got no response so far. 
Reposting to the dev mailing list.

Also: I'm trying to make some modifications to the code relating to issue
1163<https://issues.apache.org/jira/browse/OPENNLP-1163> mentioned below but I
have troubles with the style checker. I keep getting a lot of
NewlineAtEndOfFile errors, even though the files do have a new line at the end
of file. I've also made sure to replacing \r\n's with \n's, to no avail. I'm
using Maven 3.3.9 and Eclipse Neon.2

Thank you

Gabriele

Da: Gabriele Vaccari
Inviato: Friday, December 1, 2017 13:02
A: '[email protected]' <[email protected]>
Oggetto: openNLP best practices - sentence detector

Hi all,

I'm trying to use openNLP to train some models for Italian, basically to get
some familiarity with the API. To provide some background, I'm familiar with
machine learning concepts and understand what an NLP pipeline looks like,
however this is the first time I actually have to go ahead and put together an
application with all this.

So I started with the sentence detector. I was able to train an Italian SD with
a corpus of sentences from http://www.corpusitaliano.it/en/. However the
performance of the detector is somewhat below my expectations. It makes pretty
obvious mistakes, like failing to recognize an end-of-sentence full stop
(example below*), or failing to spot an abbreviation preceded by punctuation
(I've posted the issue 1163 on
Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).

Even though the documentation is very good, I feel it lacks some best practices
and suggestions. For instance:

* Is my sentence detection training set supposed to have consistent
documents or will a bunch of random sentences with a blank line every 20-30
work?
* Do my training examples in openNLP native format need to be formatted in
a special way? Will the algo ignore stuff like extra white spaces or tabs
between words? Do examples with a lot of punctuation like quotes or parenthesis
somehow affect the outcome?
* How many training examples (or events) are recommended?
* Is it better to provide a case sensitive abbreviation dictionary vs case
insensitive?
* Is the issue 1163 a known problem? I think other languages as French
might have the same thing happening.
* Are there examples of complete production-grade data sets in Italian or
other languages that have been successfully used to train openNLP tools?

I believe I could find most of these questions by just looking at the code, but
someone who already went through it maybe could point me in the right direction.
Basically, I'm asking for best practices and pro tips.

Thank you

* failure to recognize EOS full stop:
SENT_1: Molteplici furono i passi che portarono alla nascita di questa
disciplina.
SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è
l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel
1623, grazie a Willhelm
Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
si arrivò a creare macchine in grado di effettuare calcoli matematici con
numeri fino a sei cifre, anche se non in maniera autonoma.

Gabriele Vaccari
Dedalus SpA

I: openNLP best practices - sentence detector

Reply via email to