[ 
https://issues.apache.org/jira/browse/OPENNLP-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Zowalla updated OPENNLP-1163:
-------------------------------------
    Fix Version/s:     (was: 2.1.2)

> Sentence detector doesn't spot abbreviations next to punctuation
> ----------------------------------------------------------------
>
>                 Key: OPENNLP-1163
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1163
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 1.8.3
>         Environment: Reproduced on Windows 10
>            Reporter: Gabriele Vaccari
>            Priority: Critical
>              Labels: abbreviation, sentence-detector
>         Attachments: it-abbr.txt, out.txt, test.txt, training-set.txt
>
>
> The Sentence Detector trained with an abbreviations list (see attachment) 
> fails to spot them within a text if they are preceded by a punctuation mark. 
> In Italian, words starting with a vowel may be preceded by an article plus 
> apostrophe sign (single quote). Example: L'ARTICOLO (the article). The term 
> ARTICOLO, especially in legal text, is frequently abbreviated to ART.
> Repro steps:
> 1) add the "art." abbreviation in the abbreviations XML file (enclosed, 
> ctrl+F "art.", case insensitive)
> 2) train a model for the Italian language (training set enclosed) with the 
> following command:
> opennlp SentenceDetectorTrainer -abbDict "it-abbr.txt" -lang it -model 
> it-sen.bin -data training-set.txt -encoding UTF-8 
> 3) run the model against a test text with the following command:
> opennlp SentenceDetector it-sen.bin < test.txt
> Even though the abbreviation "art." was included in the XML file, the 
> sentence detector breaks the sentence on instances of this abbreviation 
> preceded by article and apostrophe (e.g. nell'art., dall'art., dell'art.). 
> See also the enclosed output file out.txt, lines 6-7, 12-13, 13-14 and 16-17.
> The issue isn't observed if the apostrophe (single quote) is replaced by a 
> space character.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to