[jira] [Commented] (OPENNLP-1479) Write better tests for pattern verification (tokenizers)

ASF GitHub Bot (Jira) Mon, 11 Dec 2023 00:55:36 -0800


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795246#comment-17795246
 ]


ASF GitHub Bot commented on OPENNLP-1479:
-----------------------------------------

kinow commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1422109419


##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEPor() throws IOException {
+    String lang = "por";
+    String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+    String sentence = "Na floresta mágica a raposa dança com unicórnios 
felizes.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMESpa() throws IOException {
+    String lang = "spa";
+    String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+    String sentence = "En el verano los niños juegan en el parque y sus risas 
crean alegría.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMECat() throws IOException {
+    String lang = "cat";
+    String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+    String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se 
ho passen bé.";

Review Comment:
   Hi @franra9 ! :wave: 
   
   >the sentence is syntactically correct but "os" does not have tilde since 
2016 (see https://esadir.cat/gramatica/criteris/diacritics)
   
   Looking at the link,
   
   >2. S'escriuen sense accent diacrític la resta de monosíl·labs i qualsevol 
paraula de més d'una síl·laba. Exemples:
   
   Ah! Today I learned! 



##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
     Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
   }
 
+  void checkCustomPatternForTokenizerME(String lang, String pattern, String 
sentence,
+      int expectedNumTokens) throws IOException {
+
+    TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+        Pattern.compile(pattern)));
+
+    TokenizerME tokenizer = new TokenizerME(model);
+    String[] tokens = tokenizer.tokenize(sentence);
+
+    Assertions.assertEquals(expectedNumTokens, tokens.length);
+    String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+    for (int i = 0; i < sentSplit.length; i++) {
+      Assertions.assertEquals(sentSplit[i], tokens[i]);
+    }
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEDeu() throws IOException {
+    String lang = "deu";
+    String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+    String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der 
botanischen Monographie.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMEPor() throws IOException {
+    String lang = "por";
+    String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+    String sentence = "Na floresta mágica a raposa dança com unicórnios 
felizes.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMESpa() throws IOException {
+    String lang = "spa";
+    String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+    String sentence = "En el verano los niños juegan en el parque y sus risas 
crean alegría.";
+    checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+  }
+
+  @Test
+  void testCustomPatternForTokenizerMECat() throws IOException {
+    String lang = "cat";
+    String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+    String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se 
ho passen bé.";

Review Comment:
   Hi @franra9 ! :wave: 
   
   >the sentence is syntactically correct but "os" does not have tilde since 
2016 (see https://esadir.cat/gramatica/criteris/diacritics)
   
   Looking at the link,
   
   >2. S'escriuen sense accent diacrític la resta de monosíl·labs i qualsevol 
paraula de més d'una síl·laba. Exemples:
   
   Ah! Today I learned! 





> Write better tests for pattern verification (tokenizers)
> --------------------------------------------------------
>
>                 Key: OPENNLP-1479
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1479
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Tokenizer
>    Affects Versions: 2.1.1
>            Reporter: Bruno P. Kinoshita
>            Assignee: Lara Marinov
>            Priority: Major
>             Fix For: 2.3.2
>
>
> From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772]
> At the moment our tests verify that the tokenizer objects are created 
> correctly (i.e. tests getters and setters, constructor, etc.), without 
> verifying the actual behavior when used in conjunction with other classes 
> (factory, tokenizer, trainers, etc).
> It would be best to test the patterns used in the factories for different 
> languages with some interesting sample data (maybe something from project 
> gutenberg, open source news sites, etc.).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (OPENNLP-1479) Write better tests for pattern verification (tokenizers)

Reply via email to