Gabriele Vaccari created OPENNLP-1163:
-----------------------------------------

             Summary: Sentence detector doesn't spot abbreviations next to 
punctuation
                 Key: OPENNLP-1163
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1163
             Project: OpenNLP
          Issue Type: Bug
          Components: Sentence Detector
    Affects Versions: 1.8.3
         Environment: Reproduced on Windows 10
            Reporter: Gabriele Vaccari
            Priority: Critical
         Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt

The Sentence Detector trained with an abbreviations list (see attachment) fails 
to spot them within a text if they are preceded by a punctuation mark. 

In Italian, words starting with a vowel may be preceded by an article plus 
apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
ARTICOLO, especially in legal text, is frequently abbreviated to ART.

Repro steps:
1) add the ART. abbreviation in the abbreviations XML file (enclosed, ctrl+F 
"art.")
2) train a model for the Italian language (training set enclosed) with the 
following command:
opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
it-sen.bin -data training-set.txt -encoding UTF-8 
3) run the model against a test text with the following command:
opennlp SentenceDetector it-sen.bin < test.txt

Even though the abbreviation "art." was included in the XML file, the sentence 
detector breaks the sentence on instances of this abbreviation preceded by 
article and apostrophe (e.g. nell'art., dall'art., dell'art.). See also the 
enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.



 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to