Richard Zowalla created OPENNLP-1781:
----------------------------------------

             Summary: SentenceDetectorME throws StringIndexOutOfBoundsException 
when sentence starts with an abbreviation
                 Key: OPENNLP-1781
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1781
             Project: OpenNLP
          Issue Type: Bug
    Affects Versions: 2.5.6
            Reporter: Richard Zowalla
            Assignee: Richard Zowalla
             Fix For: 2.5.7, 3.0.0


When an abbreviation appears at the beginning of a sentence, OpenNLP 2.5.6's 
SentenceDetectorME can throw a java.lang.StringIndexOutOfBoundsException.

This issue can be reproduced with a test like the following:

{code:java}
@Test
void testSentDetectWithAbbreviationsAtSentenceStart() {
  prepareResources(true);

  final String sent1 = "S. Träume sind eine Verbindung von Gedanken.";

  //There is no blank space before start of the second sentence.
  String[] sents = sentenceDetector.sentDetect(sent1);
  double[] probs = sentenceDetector.probs();

  assertAll(
      () -> assertEquals(1, sents.length),
      () -> assertEquals(sent1, sents[0]),
      () -> assertEquals(1, probs.length));
}{code}

A practical scenario where an abbreviation might appear at the start of a 
sentence is when using an ICD-10 code in a medical context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to