[
https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17795246#comment-17795246
]
ASF GitHub Bot commented on OPENNLP-1479:
-----------------------------------------
kinow commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1422109419
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
}
+ void checkCustomPatternForTokenizerME(String lang, String pattern, String
sentence,
+ int expectedNumTokens) throws IOException {
+
+ TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+ Pattern.compile(pattern)));
+
+ TokenizerME tokenizer = new TokenizerME(model);
+ String[] tokens = tokenizer.tokenize(sentence);
+
+ Assertions.assertEquals(expectedNumTokens, tokens.length);
+ String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+ for (int i = 0; i < sentSplit.length; i++) {
+ Assertions.assertEquals(sentSplit[i], tokens[i]);
+ }
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEDeu() throws IOException {
+ String lang = "deu";
+ String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+ String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der
botanischen Monographie.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEPor() throws IOException {
+ String lang = "por";
+ String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+ String sentence = "Na floresta mágica a raposa dança com unicórnios
felizes.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMESpa() throws IOException {
+ String lang = "spa";
+ String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+ String sentence = "En el verano los niños juegan en el parque y sus risas
crean alegría.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMECat() throws IOException {
+ String lang = "cat";
+ String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+ String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se
ho passen bé.";
Review Comment:
Hi @franra9 ! :wave:
>the sentence is syntactically correct but "os" does not have tilde since
2016 (see https://esadir.cat/gramatica/criteris/diacritics)
Looking at the link,
>2. S'escriuen sense accent diacrític la resta de monosíl·labs i qualsevol
paraula de més d'una síl·laba. Exemples:
Ah! Today I learned!
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
}
+ void checkCustomPatternForTokenizerME(String lang, String pattern, String
sentence,
+ int expectedNumTokens) throws IOException {
+
+ TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+ Pattern.compile(pattern)));
+
+ TokenizerME tokenizer = new TokenizerME(model);
+ String[] tokens = tokenizer.tokenize(sentence);
+
+ Assertions.assertEquals(expectedNumTokens, tokens.length);
+ String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+ for (int i = 0; i < sentSplit.length; i++) {
+ Assertions.assertEquals(sentSplit[i], tokens[i]);
+ }
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEDeu() throws IOException {
+ String lang = "deu";
+ String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+ String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der
botanischen Monographie.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEPor() throws IOException {
+ String lang = "por";
+ String pattern = "^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$";
+ String sentence = "Na floresta mágica a raposa dança com unicórnios
felizes.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 10);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMESpa() throws IOException {
+ String lang = "spa";
+ String pattern = "^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$";
+ String sentence = "En el verano los niños juegan en el parque y sus risas
crean alegría.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 15);
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMECat() throws IOException {
+ String lang = "cat";
+ String pattern = "^[0-9a-zàèéíïòóúüçA-ZÀÈÉÍÏÒÓÚÜÇ]+$";
+ String sentence = "Als xiuxiuejants avets el ós blau neda amb cignes i se
ho passen bé.";
Review Comment:
Hi @franra9 ! :wave:
>the sentence is syntactically correct but "os" does not have tilde since
2016 (see https://esadir.cat/gramatica/criteris/diacritics)
Looking at the link,
>2. S'escriuen sense accent diacrític la resta de monosíl·labs i qualsevol
paraula de més d'una síl·laba. Exemples:
Ah! Today I learned!
> Write better tests for pattern verification (tokenizers)
> --------------------------------------------------------
>
> Key: OPENNLP-1479
> URL: https://issues.apache.org/jira/browse/OPENNLP-1479
> Project: OpenNLP
> Issue Type: Improvement
> Components: Tokenizer
> Affects Versions: 2.1.1
> Reporter: Bruno P. Kinoshita
> Assignee: Lara Marinov
> Priority: Major
> Fix For: 2.3.2
>
>
> From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772]
> At the moment our tests verify that the tokenizer objects are created
> correctly (i.e. tests getters and setters, constructor, etc.), without
> verifying the actual behavior when used in conjunction with other classes
> (factory, tokenizer, trainers, etc).
> It would be best to test the patterns used in the factories for different
> languages with some interesting sample data (maybe something from project
> gutenberg, open source news sites, etc.).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)