[
https://issues.apache.org/jira/browse/OPENNLP-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794626#comment-17794626
]
ASF GitHub Bot commented on OPENNLP-1479:
-----------------------------------------
mawiesne commented on code in PR #559:
URL: https://github.com/apache/opennlp/pull/559#discussion_r1420243259
##########
opennlp-tools/src/test/java/opennlp/tools/tokenize/TokenizerFactoryTest.java:
##########
@@ -163,6 +163,106 @@ void testCustomPatternAndAlphaOpt() throws IOException {
Assertions.assertTrue(factory.isUseAlphaNumericOptimization());
}
+ void checkCustomPatternForTokenizerME(String lang, String pattern, String
sentence,
+ int expectedNumTokens) throws IOException {
+
+ TokenizerModel model = train(new TokenizerFactory(lang, null, true,
+ Pattern.compile(pattern)));
+
+ TokenizerME tokenizer = new TokenizerME(model);
+ String[] tokens = tokenizer.tokenize(sentence);
+
+ Assertions.assertEquals(expectedNumTokens, tokens.length);
+ String[] sentSplit = sentence.replaceAll("\\.", " .").split(" ");
+ for (int i = 0; i < sentSplit.length; i++) {
+ Assertions.assertEquals(sentSplit[i], tokens[i]);
+ }
+ }
+
+ @Test
+ void testCustomPatternForTokenizerMEDeu() throws IOException {
+ String lang = "deu";
+ String pattern = "^[A-Za-z0-9äéöüÄÉÖÜß]+$";
+ String sentence = "Ich wähle den auf S. 183 ff. mitgeteilten Traum von der
botanischen Monographie.";
+ checkCustomPatternForTokenizerME(lang, pattern, sentence, 16);
Review Comment:
Generally speaking: the lack of abbreviation support affects all languages,
not only German.
> Write better tests for pattern verification (tokenizers)
> --------------------------------------------------------
>
> Key: OPENNLP-1479
> URL: https://issues.apache.org/jira/browse/OPENNLP-1479
> Project: OpenNLP
> Issue Type: Improvement
> Components: Tokenizer
> Affects Versions: 2.1.1
> Reporter: Bruno P. Kinoshita
> Assignee: Lara Marinov
> Priority: Major
> Fix For: 2.3.2
>
>
> From [https://github.com/apache/opennlp/pull/516#issuecomment-1455015772]
> At the moment our tests verify that the tokenizer objects are created
> correctly (i.e. tests getters and setters, constructor, etc.), without
> verifying the actual behavior when used in conjunction with other classes
> (factory, tokenizer, trainers, etc).
> It would be best to test the patterns used in the factories for different
> languages with some interesting sample data (maybe something from project
> gutenberg, open source news sites, etc.).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)