[
https://issues.apache.org/jira/browse/OPENNLP-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Martin Wiesner updated OPENNLP-1809:
------------------------------------
Description:
As a follow-up of OPENNLP-1781, a deeper inspection with real world data
revealed that _SentenceDetectorME_ does not work as expected when a sentence
starts with a multi-letter abbreviation.
Example text:
"Bek. Problem: Schlafmangel. Über die letzten Tage hinweg war sie zunehmend
müde."
Expected: 2 sentences:
* "Bek. Problem: Schlafmangel."
* "Über die letzten Tage hinweg war sie zunehmend müde."
However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is added
to the abbreviation xml file for the German language.
Goal:
The fix shall resolve this bug.
was:
As a follow-up of OPENNLP-1781, a deeper inspection with real world data
revealed that SentenceDetectorME does not work as expected when a sentence
starts with a multi-letter abbreviation.
Example text:
"Bek. Problem: Schlafmangel. Über die letzten Tage hinweg war sie zunehmend
müde."
Expected: 2 sentences:
* "Bek. Problem: Schlafmangel."
* "Über die letzten Tage hinweg war sie zunehmend müde."
However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is added
to the abbreviation xml file for the German language.
Goal:
The fix shall resolve this bug.
> SentenceDetector misses multi-letter abbreviations at sentence start
> --------------------------------------------------------------------
>
> Key: OPENNLP-1809
> URL: https://issues.apache.org/jira/browse/OPENNLP-1809
> Project: OpenNLP
> Issue Type: Bug
> Components: Sentence Detector
> Affects Versions: 2.5.8, 3.0.0-M1
> Reporter: Martin Wiesner
> Assignee: Martin Wiesner
> Priority: Major
> Fix For: 3.0.0-M2
>
>
> As a follow-up of OPENNLP-1781, a deeper inspection with real world data
> revealed that _SentenceDetectorME_ does not work as expected when a sentence
> starts with a multi-letter abbreviation.
> Example text:
> "Bek. Problem: Schlafmangel. Über die letzten Tage hinweg war sie zunehmend
> müde."
> Expected: 2 sentences:
> * "Bek. Problem: Schlafmangel."
> * "Über die letzten Tage hinweg war sie zunehmend müde."
> However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is
> added to the abbreviation xml file for the German language.
> Goal:
> The fix shall resolve this bug.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)