[ https://issues.apache.org/jira/browse/OPENNLP-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16629651#comment-16629651 ]
ASF GitHub Bot commented on OPENNLP-1221: ----------------------------------------- GitHub user kojisekig opened a pull request: https://github.com/apache/opennlp-addons/pull/3 OPENNLP-1221: FeatureGeneratorUtil.tokenFeature() is too specific for… … some languages You can merge this pull request into a Git repository by running: $ git pull https://github.com/kojisekig/opennlp-addons OPENNLP-1221 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/opennlp-addons/pull/3.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3 ---- commit 066b5a1f1c2ba972c7fdd025cb4d4689a0e04e97 Author: koji <koji@...> Date: 2018-09-27T01:56:13Z OPENNLP-1221: FeatureGeneratorUtil.tokenFeature() is too specific for some languages ---- > FeatureGeneratorUtil.tokenFeature() is too specific for some languages > ---------------------------------------------------------------------- > > Key: OPENNLP-1221 > URL: https://issues.apache.org/jira/browse/OPENNLP-1221 > Project: OpenNLP > Issue Type: Improvement > Affects Versions: 1.9.0 > Reporter: Koji Sekiguchi > Priority: Minor > > As I described in OPENNLP-1197, in Japanese NER problem, we usually use only > DIGIT, HIRA (あ, い, う, え, お etc.), KATA (ア, イ, ウ, エ, オ etc.), ALPHA and OTHER > for token classes. What FeatureGeneratorUtil.tokenFeature() provides at > present are too specific. I don't need to distinguish among lc (lowercase > alphabet), ac (all capital letters) and ic (initial capital letter), for > example. > By way of trial, if I applied the following patch in order to avoid "too > specific token class generation": > {code} > diff --git > a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java > > b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java > index e6b8af95..405938d1 100644 > --- > a/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java > +++ > b/opennlp-tools/src/main/java/opennlp/tools/util/featuregen/FeatureGeneratorUtil.java > @@ -29,6 +29,8 @@ public class FeatureGeneratorUtil { > private static final String TOKEN_AND_CLASS_PREFIX = "w&c"; > > private static final Pattern capPeriod = Pattern.compile("^[A-Z]\\.$"); > + private static final Pattern pDigit = Pattern.compile("^\\p{IsDigit}+$"); > + private static final Pattern pAlpha = > Pattern.compile("^\\p{IsAlphabetic}+$"); > > /** > * Generates a class name for the specified token. > @@ -64,48 +66,11 @@ public class FeatureGeneratorUtil { > else if (pattern.isAllKatakana()) { > feat = "jak"; > } > - else if (pattern.isAllLowerCaseLetter()) { > - feat = "lc"; > + else if (pDigit.matcher(token).find()) { > + feat = "digit"; > } > - else if (pattern.digits() == 2) { > - feat = "2d"; > - } > - else if (pattern.digits() == 4) { > - feat = "4d"; > - } > - else if (pattern.containsDigit()) { > - if (pattern.containsLetters()) { > - feat = "an"; > - } > - else if (pattern.containsHyphen()) { > - feat = "dd"; > - } > - else if (pattern.containsSlash()) { > - feat = "ds"; > - } > - else if (pattern.containsComma()) { > - feat = "dc"; > - } > - else if (pattern.containsPeriod()) { > - feat = "dp"; > - } > - else { > - feat = "num"; > - } > - } > - else if (pattern.isAllCapitalLetter()) { > - if (token.length() == 1) { > - feat = "sc"; > - } > - else { > - feat = "ac"; > - } > - } > - else if (capPeriod.matcher(token).find()) { > - feat = "cp"; > - } > - else if (pattern.isInitialCapitalLetter()) { > - feat = "ic"; > + else if (pAlpha.matcher(token).find()) { > + feat = "alpha"; > } > else { > feat = "other"; > {code} > total F1 was increased from 82.00% to 82.13%. It may be trivial, but I think > I have a lot of room yet to tune and increase the performance. > Fortunately, I could add japanese-addon project to opennlp-addons in the > previous ticket, I'd like to add some programs that generate simpler token > classes in japanese-addon. -- This message was sent by Atlassian JIRA (v7.6.3#76005)