[ https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849906#comment-17849906 ]
ASF GitHub Bot commented on OPENNLP-1563: ----------------------------------------- demq commented on code in PR #602: URL: https://github.com/apache/opennlp/pull/602#discussion_r1616698074 ########## opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java: ########## @@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() { Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", "b", "\r", "\n", "\r", "\n", "c"}, tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c")); } + + /** + * Tests if it can tokenize a word containing a non-spacing character + * like Arabic Damma Unicode Character “◌ُ” (U+064F) + */ + @Test + void testNonSpacingLetters() { + String text = "طُوّر"; Review Comment: I have just pushed an update with a full sentence. > SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters > ----------------------------------------------------------------------------- > > Key: OPENNLP-1563 > URL: https://issues.apache.org/jira/browse/OPENNLP-1563 > Project: OpenNLP > Issue Type: Bug > Components: Tokenizer > Affects Versions: 2.3.3 > Reporter: Hrayr Matevosyan > Priority: Major > > The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes > words containing non-spacing letters. For example, the Arabic word "طُوّر" > gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"]. -- This message was sent by Atlassian Jira (v8.20.10#820010)