[ 
https://issues.apache.org/jira/browse/OPENNLP-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693656#comment-17693656
 ] 

ASF GitHub Bot commented on OPENNLP-141:
----------------------------------------

kinow commented on code in PR #506:
URL: https://github.com/apache/opennlp/pull/506#discussion_r1118098477


##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -25,24 +25,45 @@
 
 public class Factory {
 
-  public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$";
+  public static final Pattern DEFAULT_ALPHANUMERIC = 
Pattern.compile("^[A-Za-z0-9]+$");
+
+  private static final Pattern PORTOGUESE = 
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+  private static final Pattern FRENCH = 
Pattern.compile("^[a-zA-Z0-9àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ]+$");
+
+  // For reference: https://www.sttmedia.com/characterfrequency-dutch
+  private static final Pattern DUTCH = 
Pattern.compile("^[A-Za-z0-9äöüëèéïijÄÖÜËÉÈÏIJ]+$");
+  private static final Pattern GERMAN = 
Pattern.compile("^[A-Za-z0-9äöüÄÖÜß]+$");
 
   /**
-   * Gets the alphanumeric pattern for the language. Please save the value
-   * locally because this call is expensive.
+   * Gets the alphanumeric pattern for a language.
    *
-   * @param languageCode The language code. If {@code null}, or unknown,
-   *                     the default pattern will be returned.
-   * @return The alphanumeric pattern for the language or the default pattern.
+   * @param languageCode The ISO_639-1 code. If {@code null}, or unknown, the
+   *                     {@link #DEFAULT_ALPHANUMERIC} pattern will be 
returned.
+   * @return The alphanumeric {@link Pattern} for the language, or the default 
pattern.
    */
   public Pattern getAlphanumeric(String languageCode) {
+    // For reference: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
     if ("pt".equals(languageCode) || "por".equals(languageCode)) {
-      return Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
+      return PORTOGUESE;

Review Comment:
   PORTUGUESE



##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -25,24 +25,45 @@
 
 public class Factory {
 
-  public static final String DEFAULT_ALPHANUMERIC = "^[A-Za-z0-9]+$";
+  public static final Pattern DEFAULT_ALPHANUMERIC = 
Pattern.compile("^[A-Za-z0-9]+$");
+
+  private static final Pattern PORTOGUESE = 
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");

Review Comment:
   s/PORTOGUESE/PORTUGUESE





> Tokenizers alpha numeric optimization only recognizes a-z as alpha chars
> ------------------------------------------------------------------------
>
>                 Key: OPENNLP-141
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-141
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Tokenizer
>    Affects Versions: tools-1.5.0-sourceforge
>            Reporter: Jörn Kottmann
>            Assignee: Martin Wiesner
>            Priority: Minor
>
> The Tokenizer has an optimization which skips tokens which are only made of 
> numerics or alpha chars. In foreign languages the alpha chars contain umlauts 
> and other letters which are not included in the a-z range.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to