[
https://issues.apache.org/jira/browse/OPENNLP-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696553#comment-17696553
]
ASF GitHub Bot commented on OPENNLP-1474:
-----------------------------------------
kinow commented on code in PR #516:
URL: https://github.com/apache/opennlp/pull/516#discussion_r1125635587
##########
opennlp-tools/src/main/java/opennlp/tools/tokenize/lang/Factory.java:
##########
@@ -30,10 +30,27 @@ public class Factory {
private static final Pattern PORTUGUESE =
Pattern.compile("^[0-9a-záãâàéêíóõôúüçA-ZÁÃÂÀÉÊÍÓÕÔÚÜÇ]+$");
private static final Pattern FRENCH =
Pattern.compile("^[a-zA-Z0-9àâäèéêëîïôœùûüÿçÀÂÄÈÉÊËÎÏÔŒÙÛÜŸÇ]+$");
- // For reference: https://www.sttmedia.com/characterfrequency-dutch
+ // From: https://www.sttmedia.com/characterfrequency-dutch
private static final Pattern DUTCH =
Pattern.compile("^[A-Za-z0-9äöüëèéïijÄÖÜËÉÈÏIJ]+$");
- private static final Pattern GERMAN =
Pattern.compile("^[A-Za-z0-9äöüÄÖÜß]+$");
+ // Note: The extra é and É are included to cover German "Lehnwörter" such as
"Café"
+ private static final Pattern GERMAN =
Pattern.compile("^[A-Za-z0-9äéöüÄÉÖÜß]+$");
+
+ // From: https://en.wikipedia.org/wiki/Polish_alphabet
+ // https://pl.wikipedia.org/wiki/Alfabet_polski
+ private static final Pattern POLISH =
Pattern.compile("^[A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ]+$");
Review Comment:
Just my OCD here, @mawiesne , but could we keep the same order for lower
case and upper case? :grimacing:
s/A-Za-z0-9żźćńółęąśŻŹĆĄŚĘŁÓŃ/A-Za-z0-9żźćąśęłóńŻŹĆĄŚĘŁÓŃ (I was reading the
upper case as "alphanum and Z Z Caselon", and thought it was an easy way to
memorize it, so went with that for the lower case chars too, but we can change
it if that makes more sense)
> Create tokenizer factories for other langs (Spanish, Italian, ...)
> ------------------------------------------------------------------
>
> Key: OPENNLP-1474
> URL: https://issues.apache.org/jira/browse/OPENNLP-1474
> Project: OpenNLP
> Issue Type: Improvement
> Components: Tokenizer
> Affects Versions: 2.1.1
> Reporter: Bruno P. Kinoshita
> Assignee: Martin Wiesner
> Priority: Major
> Fix For: 2.1.2
>
>
> From [https://github.com/apache/opennlp/pull/506#issuecomment-1445849746]
> We can create more factories for languages such as Spanish and Italian. For
> example:
> {noformat}
> // From: https://it.wikipedia.org/wiki/Alfabeto_italiano
> private static final Pattern ITALIAN =
> Pattern.compile("^[0-9a-zàèéìîíòóùüA-ZÀÈÉÌÎÍÒÓÙÜ]+$");
> // From: https://en.wikiversity.org/wiki/Alphabet/Spanish_alphabet &
> https://en.wikipedia.org/wiki/Spanish_orthography#Alphabet_in_Spanish &
> https://www.fundeu.es/consulta/tilde-en-la-y-y-griega-o-ye-24786/
> private static final Pattern SPANISH =
> Pattern.compile("^[0-9a-záéíóúüýñA-ZÁÉÍÓÚÝÑ]+$"); {noformat}
> Community feedback would be appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)