[ 
https://issues.apache.org/jira/browse/OPENNLP-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Wiesner updated OPENNLP-1809:
------------------------------------
    Description: 
As a follow-up of OPENNLP-1781, a deeper inspection with real world data 
revealed that SentenceDetectorME does not work as expected when a sentence 
starts with a multi-letter abbreviation.

Example text:

"Bek. Problem: Schlafmangel. Über die letzten Tage hinweg war sie zunehmend 
müde."

Expected: 2 sentences: 
 * "Bek. Problem: Schlafmangel."
 * "Über die letzten Tage hinweg war sie zunehmend müde."

However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is added 
to the abbreviation xml file for the German language.

Goal:
The fix shall resolve this bug.

  was:
As a follow-up of OPENNLP-1781, a deeper inspection with real world data 
revealed that SentenceDetectorME does not work as expected when a sentence 
starts with a multi-letter abbreviation.

Example text:

"Bek. Problem: Schlafmangel. Über die letzten Tage hinweg zunehmend müde."

Expected: 2 sentences: 
 * "Bek. Problem: Schlafmangel."
 * "Über die letzten Tage hinweg war sie zunehmend müde."

However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is added 
to the abbreviation xml file for the German language.

Goal:
The fix shall resolve this bug.


> SentenceDetector misses multi-letter abbreviations at sentence start
> --------------------------------------------------------------------
>
>                 Key: OPENNLP-1809
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1809
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Sentence Detector
>    Affects Versions: 2.5.8, 3.0.0-M1
>            Reporter: Martin Wiesner
>            Assignee: Martin Wiesner
>            Priority: Major
>             Fix For: 3.0.0-M2
>
>
> As a follow-up of OPENNLP-1781, a deeper inspection with real world data 
> revealed that SentenceDetectorME does not work as expected when a sentence 
> starts with a multi-letter abbreviation.
> Example text:
> "Bek. Problem: Schlafmangel. Über die letzten Tage hinweg war sie zunehmend 
> müde."
> Expected: 2 sentences: 
>  * "Bek. Problem: Schlafmangel."
>  * "Über die letzten Tage hinweg war sie zunehmend müde."
> However, 3 sentences are returned, even "Bek." (Bekanntes -> "known") is 
> added to the abbreviation xml file for the German language.
> Goal:
> The fix shall resolve this bug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to