[jira] [Reopened] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

Koji Sekiguchi (JIRA) Tue, 26 Jun 2018 18:43:57 -0700


     [ 
https://issues.apache.org/jira/browse/OPENNLP-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Sekiguchi reopened OPENNLP-1197:
-------------------------------------

After applying this patch, Eval tests which don't run via mvn test cannot be 
successful. I reopen this and investigate.

> FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words
> --------------------------------------------------------------------------
>
>                 Key: OPENNLP-1197
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1197
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Machine Learning
>    Affects Versions: 1.8.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Major
>
> FeatureGeneratorUtil.tokenFeature() always recognizes Japanese words as "lc" 
> (lower case). It looks a bug to me because they're not lower case letters, 
> but other than that, it seems that FeatureGeneratorUtil.tokenFeature() takes 
> care only Europe/American languages.
> For example, in Japanese NER problem, typical token classes are as follows:
> - DIGIT
> - HIRA : あ, い, う, え, お etc.
> - KATA : ア, イ, ウ, エ, オ etc.
> - ALPHA : we don't need to distinguish lower/upper case
> - OTHER
> I think it's possible that we get FeatureGeneratorUtil.tokenFeature() to have 
> additional token classes I mentioned above, but later on, someone who comes 
> from Asia and may claim similar thing.
> I'd like to make FeatureGeneratorUtil plugable, but I don't have any idea now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Reopened] (OPENNLP-1197) FeatureGeneratorUtil.tokenFeature() always returns "lc" for Japanese words

Reply via email to