[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

ASF GitHub Bot (Jira) Tue, 28 May 2024 00:04:03 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17849906#comment-17849906
 ]


ASF GitHub Bot commented on OPENNLP-1563:
-----------------------------------------

demq commented on code in PR #602:
URL: https://github.com/apache/opennlp/pull/602#discussion_r1616698074


##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/SimpleTokenizerTest.java:
##########
@@ -128,4 +128,18 @@ void testTokenizationOfStringWithWindowsNewLineTokens() {
     Assertions.assertArrayEquals(new String[] {"a", "\r", "\n", "\r", "\n", 
"b", "\r", "\n", "\r", "\n", "c"},
         tokenizer.tokenize("a\r\n\r\n b\r\n\r\n c"));
   }
+
+  /**
+   * Tests if it can tokenize a word containing a non-spacing character
+   * like Arabic Damma Unicode Character “◌ُ” (U+064F)
+   */
+  @Test
+  void testNonSpacingLetters() {
+    String text = "طُوّر";

Review Comment:
   I have just pushed an update with a full sentence.





> SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters
> -----------------------------------------------------------------------------
>
>                 Key: OPENNLP-1563
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1563
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Tokenizer
>    Affects Versions: 2.3.3
>            Reporter: Hrayr Matevosyan
>            Priority: Major
>
> The tokenizerPos implementation of SimpleTokenizer incorrectly tokenizes 
> words containing non-spacing letters. For example, the Arabic word "طُوّر" 
> gets tokenized to individual letters ["ط", "ُ", "و", "ّ", "ر"].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1563) SimpleTokenizer.tokenizePos incorrectly splits words with non-spacing letters

Reply via email to