[
https://issues.apache.org/jira/browse/OPENNLP-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693656#comment-17693656
]
ASF GitHub Bot commented on OPENNLP-141:
----------------------------------------
kinow commented on code in PR #506:
URL: https://github.com/apache/opennlp/pull/506#discussion_r1118098477
##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -25,24 +25,45 @@
public class Factory {
- public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$";
+ public static final Pattern DEFAULT_ALPHANUMERIC =
Pattern.compile("^[A-Za-z0-9]+$");
+
+ private static final Pattern PORTOGUESE =
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+ private static final Pattern FRENCH =
Pattern.compile("^[a-zA-Z0-9àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ]+$");
+
+ // For reference: https://www.sttmedia.com/characterfrequency-dutch
+ private static final Pattern DUTCH =
Pattern.compile("^[A-Za-z0-9äöüëèéïijÄÖÜËÉÈÏIJ]+$");
+ private static final Pattern GERMAN =
Pattern.compile("^[A-Za-z0-9äöüÄÖÜß]+$");
/**
- * Gets the alphanumeric pattern for the language. Please save the value
- * locally because this call is expensive.
+ * Gets the alphanumeric pattern for a language.
*
- * @param languageCode The language code. If {@code null}, or unknown,
- * the default pattern will be returned.
- * @return The alphanumeric pattern for the language or the default pattern.
+ * @param languageCode The ISO_639-1 code. If {@code null}, or unknown, the
+ * {@link #DEFAULT_ALPHANUMERIC} pattern will be
returned.
+ * @return The alphanumeric {@link Pattern} for the language, or the default
pattern.
*/
public Pattern getAlphanumeric(String languageCode) {
+ // For reference: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
if ("pt".equals(languageCode) || "por".equals(languageCode)) {
- return Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+ return PORTOGUESE;
Review Comment:
PORTUGUESE
##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -25,24 +25,45 @@
public class Factory {
- public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$";
+ public static final Pattern DEFAULT_ALPHANUMERIC =
Pattern.compile("^[A-Za-z0-9]+$");
+
+ private static final Pattern PORTOGUESE =
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
Review Comment:
s/PORTOGUESE/PORTUGUESE
> Tokenizers alpha numeric optimization only recognizes a-z as alpha chars
> ------------------------------------------------------------------------
>
> Key: OPENNLP-141
> URL: https://issues.apache.org/jira/browse/OPENNLP-141
> Project: OpenNLP
> Issue Type: Bug
> Components: Tokenizer
> Affects Versions: tools-1.5.0-sourceforge
> Reporter: Jörn Kottmann
> Assignee: Martin Wiesner
> Priority: Minor
>
> The Tokenizer has an optimization which skips tokens which are only made of
> numerics or alpha chars. In foreign languages the alpha chars contain umlauts
> and other letters which are not included in the a-z range.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)