Martin Wiesner created OPENNLP-1809:
---------------------------------------
Summary: SentenceDetector misses multi-letter abbreviations at
sentence start
Key: OPENNLP-1809
URL: https://issues.apache.org/jira/browse/OPENNLP-1809
Project: OpenNLP
Issue Type: Bug
Components: Sentence Detector
Affects Versions: 3.0.0-M1, 2.5.8
Reporter: Martin Wiesner
Assignee: Martin Wiesner
Fix For: 3.0.0-M2
As a follow-up of OPENNLP-1781, a deeper inspection with real world data
revealed that SentenceDetectorME does not work as expected when a sentence
starts with a multi-letter abbreviation.
Example text:
"Bek. Problem: Schlafmangel. Über die letzten Tage hinweg zunehmend müde."
Expected: 2 sentences:
* "Bek. Problem: Schlafmangel."
* "Über die letzten Tage hinweg zunehmend müde."
However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is added
to the abbreviation xml file for the German language.
Goal:
The fix shall resolve this bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)