[ 
https://issues.apache.org/jira/browse/OPENNLP-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi resolved OPENNLP-1201.
-------------------------------------
    Resolution: Fixed
      Assignee: Koji Sekiguchi

This feature has been added to opennlp-addons. Thanks!

> add bailout way for certain languages in order to use POS features
> ------------------------------------------------------------------
>
>                 Key: OPENNLP-1201
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1201
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Command Line Interface, Formats
>    Affects Versions: 1.8.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Major
>
> As OpenNLP tools depend on the fact that text being processed needs to be 
> tokenized in advance (in other words, words in the text are separated each 
> other by space), it is difficult for uses who use certain languages (e.g. 
> CJK) to use POS (Part-of-Speech) features.
> To simplify the explanation, consider using NameFinder for Japanese text. In 
> NameFinder tools (Train, Eval, Recognize), they require that users should 
> provide Japanese text which has already been tokenized, but once we tokenize 
> Japanese text, it loses POS information. (I think Chinese language has same 
> problem)
> Let me describe this problem for western language users :) (English, French, 
> Italian, etc.) without using Japanese letters. I’ll try to use English 
> alphabet, instead.
> Suppose you have a sentence text “isentthemachine” which you want to give 
> NameFinder, you use morphological analyzer in order to tokenize the sentence. 
> There are two possible sequence of tokens:
> - i (PPSS) / sent (VBD) / the (AT) / machine (NP)
> - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP)
> As you noticed, morphological analyzer not only tokenizes the sentence, but 
> also tags POS tag to each token. Same thing takes place in Japanese language 
> (and Chinese language, I think).
> However, in OpenNLP feature generator API, it accepts sequence of tokens thru 
> API i.e. `String[] tokens`, I cannot produce POS feature in the feature 
> generator.
> To solve this problem (and to invite many users to our community), I’d like 
> to suggest that OpenNLP tools allow users to add optional information to each 
> tokenized word.
> For example, one can give the following text when using NameFinder tools.
> {code}
> $ cat en-ner.train
> I/PPSS sent/VBD the/AT machine/NP
> {code}
> When using such text, they must inform the tool that the token has POS tag in 
> the text by using a certain option e.g. -postag
> {code}
> $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag
> {code}
> We can maintain the backward compatibility to set -postag false by default 
> and in this case, existing feature generators work exactly the same as 
> before. If a user set -postag option in the command line, the existing 
> feature generators eliminate “/POS” part of token “word/POS” in the text so 
> that they can produce same features as before.
> I’d like to add a simple feature generator which generates only “POS” part of 
> token “word/POS” in the text, in addition to managing -postag option. This 
> simple feature generator allows Japanese/Chinese users to produce precise POS 
> features.
> I’d like to focus on NameFinder in this ticket (Let me add this option to 
> other tools (chunker, classifier, etc.) in another ticket, if needed).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to