Gabriele Vaccari created OPENNLP-1163:
-----------------------------------------
Summary: Sentence detector doesn't spot abbreviations next to
punctuation
Key: OPENNLP-1163
URL: https://issues.apache.org/jira/browse/OPENNLP-1163
Project: OpenNLP
Issue Type: Bug
Components: Sentence Detector
Affects Versions: 1.8.3
Environment: Reproduced on Windows 10
Reporter: Gabriele Vaccari
Priority: Critical
Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
The Sentence Detector trained with an abbreviations list (see attachment) fails
to spot them within a text if they are preceded by a punctuation mark.
In Italian, words starting with a vowel may be preceded by an article plus
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term
ARTICOLO, especially in legal text, is frequently abbreviated to ART.
Repro steps:
1) add the ART. abbreviation in the abbreviations XML file (enclosed, ctrl+F
"art.")
2) train a model for the Italian language (training set enclosed) with the
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model
it-sen.bin -data training-set.txt -encoding UTF-8
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt
Even though the abbreviation "art." was included in the XML file, the sentence
detector breaks the sentence on instances of this abbreviation preceded by
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)