Hi all,
I sent this message to the users mailing list but got no response so far. 
Reposting to the dev mailing list.

Also: I'm trying to make some modifications to the code relating to issue 
1163<https://issues.apache.org/jira/browse/OPENNLP-1163> mentioned below but I 
have troubles with the style checker. I keep getting a lot of 
NewlineAtEndOfFile errors, even though the files do have a new line at the end 
of file. I've also made sure to replacing \r\n's with \n's, to no avail. I'm 
using Maven 3.3.9 and Eclipse Neon.2

Thank you

Gabriele


Da: Gabriele Vaccari
Inviato: Friday, December 1, 2017 13:02
A: 'us...@opennlp.apache.org' <us...@opennlp.apache.org>
Oggetto: openNLP best practices - sentence detector

Hi all,

I'm trying to use openNLP to train some models for Italian, basically to get 
some familiarity with the API. To provide some background, I'm familiar with 
machine learning concepts and understand what an NLP pipeline looks like, 
however this is the first time I actually have to go ahead and put together an 
application with all this.

So I started with the sentence detector. I was able to train an Italian SD with 
a corpus of sentences from http://www.corpusitaliano.it/en/. However the 
performance of the detector is somewhat below my expectations. It makes pretty 
obvious mistakes, like failing to recognize an end-of-sentence full stop 
(example below*), or failing to spot an abbreviation preceded by punctuation 
(I've posted the issue 1163 on 
Jira<https://issues.apache.org/jira/browse/OPENNLP-1163> for this).

Even though the documentation is very good, I feel it lacks some best practices 
and suggestions. For instance:

  *   Is my sentence detection training set supposed to have consistent 
documents or will a bunch of random sentences with a blank line every 20-30 
work?
  *   Do my training examples in openNLP native format need to be formatted in 
a special way? Will the algo ignore stuff like extra white spaces or tabs 
between words? Do examples with a lot of punctuation like quotes or parenthesis 
somehow affect the outcome?
  *   How many training examples (or events) are recommended?
  *   Is it better to provide a case sensitive abbreviation dictionary vs case 
insensitive?
  *   Is the issue 1163 a known problem? I think other languages as French 
might have the same thing happening.
  *   Are there examples of complete production-grade data sets in Italian or 
other languages that have been successfully used to train openNLP tools?

I believe I could find most of these questions by just looking at the code, but 
someone who already went through it maybe could point me in the right direction.
Basically, I'm asking for best practices and pro tips.

Thank you

* failure to recognize EOS full stop:
SENT_1: Molteplici furono i passi che portarono alla nascita di questa 
disciplina.
SENT_2: Il primo, sia a livello di importanza che di ordine cronologico, è 
l'avvento dei calcolatori ed il continuo interesse rivolto ad essi. Già nel 
1623, grazie a Willhelm 
Sickhart<https://it.wikipedia.org/w/index.php?title=Willhelm_Sickhart&action=edit&redlink=1>,
 si arrivò a creare macchine in grado di effettuare calcoli matematici con 
numeri fino a sei cifre, anche se non in maniera autonoma.


Gabriele Vaccari
Dedalus SpA

Reply via email to