[ https://issues.apache.org/jira/browse/OPENNLP-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Martin Wiesner updated OPENNLP-1767: ------------------------------------ Description: Atm, sentence detection works incorrectly in case an abbreviation dictionary is loaded which contains common abbreviations, that is, if an abbreviation such as "S." (page in German) overlaps at the sentence end, the actual sentence end is not respected and the subsequent sentence is glued to the previous one. Consequently, the actual sentence boundary is not respected and causes a mismatch. Examples for the German language: - "Die Frage wurde gestellt. Sie wurde beantwortet." - "Es lag am DBMS. Die Performance muss verbessert werden." A reproducer can easily be constructed via a JUnit test for {{SentenceDetectorMEGermanTest}}. Note: Affects all other languages as well. Therefore, the implications have more impact and thus have a higher priority than usual. was: Atm, sentence detection works incorrectly in case an abbreviation dictionary is loaded which contains common abbreviations, that is, if an abbreviation such as "S." (page in German) overlaps at the sentence end, the actual sentence end is not respected and the subsequent sentence is glued to the previous one. Consequently, the actual sentence boundary is not respected and causes a mismatch. Examples for the German language: - "Die Frage wurde gestellt. Sie wurde beantwortet." - "Es lag am DBMS. Die Performance muss verbessert werden." A reproducer can easily be constructed via a JUnit test for {{SentenceDetectorMEGermanTest}}. Note: Affects all other languages as well. Therefore, the implications are of a higher priority than usual. > Fix sentence detection when an abbreviation overlaps at sentence end > -------------------------------------------------------------------- > > Key: OPENNLP-1767 > URL: https://issues.apache.org/jira/browse/OPENNLP-1767 > Project: OpenNLP > Issue Type: Bug > Components: Sentence Detector > Affects Versions: 2.5.5 > Reporter: Martin Wiesner > Assignee: Martin Wiesner > Priority: Major > Fix For: 2.5.6, 3.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Atm, sentence detection works incorrectly in case an abbreviation dictionary > is loaded which contains common abbreviations, that is, if an abbreviation > such as "S." (page in German) overlaps at the sentence end, the actual > sentence end is not respected and the subsequent sentence is glued to the > previous one. Consequently, the actual sentence boundary is not respected and > causes a mismatch. > Examples for the German language: > - "Die Frage wurde gestellt. Sie wurde beantwortet." > - "Es lag am DBMS. Die Performance muss verbessert werden." > A reproducer can easily be constructed via a JUnit test for > {{SentenceDetectorMEGermanTest}}. > Note: > Affects all other languages as well. Therefore, the implications have more > impact and thus have a higher priority than usual. -- This message was sent by Atlassian Jira (v8.20.10#820010)