Martin Wiesner created OPENNLP-1555:
---------------------------------------

             Summary: TokenizerME should detect multi-dot abbreviations
                 Key: OPENNLP-1555
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1555
             Project: OpenNLP
          Issue Type: Improvement
          Components: Tokenizer
    Affects Versions: 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.0, 2.1.0
            Reporter: Martin Wiesner
            Assignee: Martin Wiesner
             Fix For: 2.3.4


TokenizerME should detect and handle multi-dot abbreviations correctly. 
Currently, this is not handled correctly. For instance,

German: "z.B." = "zum Beispiel" (for example) or, 
Dutch: "e.v." = "en volgende" (and following)

are not tokenized correctly and extra tokens are returned. NOTE: no whitespaces 
in between the dots in the above examples.

Aims:
 * Fix the detection / handling of abbreviations for multi-dot abbreviations
 * Provide test cases that cover these cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to