[ https://issues.apache.org/jira/browse/OPENNLP-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi resolved OPENNLP-1201. ------------------------------------- Resolution: Fixed Assignee: Koji Sekiguchi This feature has been added to opennlp-addons. Thanks! > add bailout way for certain languages in order to use POS features > ------------------------------------------------------------------ > > Key: OPENNLP-1201 > URL: https://issues.apache.org/jira/browse/OPENNLP-1201 > Project: OpenNLP > Issue Type: Improvement > Components: Command Line Interface, Formats > Affects Versions: 1.8.4 > Reporter: Koji Sekiguchi > Assignee: Koji Sekiguchi > Priority: Major > > As OpenNLP tools depend on the fact that text being processed needs to be > tokenized in advance (in other words, words in the text are separated each > other by space), it is difficult for uses who use certain languages (e.g. > CJK) to use POS (Part-of-Speech) features. > To simplify the explanation, consider using NameFinder for Japanese text. In > NameFinder tools (Train, Eval, Recognize), they require that users should > provide Japanese text which has already been tokenized, but once we tokenize > Japanese text, it loses POS information. (I think Chinese language has same > problem) > Let me describe this problem for western language users :) (English, French, > Italian, etc.) without using Japanese letters. I’ll try to use English > alphabet, instead. > Suppose you have a sentence text “isentthemachine” which you want to give > NameFinder, you use morphological analyzer in order to tokenize the sentence. > There are two possible sequence of tokens: > - i (PPSS) / sent (VBD) / the (AT) / machine (NP) > - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP) > As you noticed, morphological analyzer not only tokenizes the sentence, but > also tags POS tag to each token. Same thing takes place in Japanese language > (and Chinese language, I think). > However, in OpenNLP feature generator API, it accepts sequence of tokens thru > API i.e. `String[] tokens`, I cannot produce POS feature in the feature > generator. > To solve this problem (and to invite many users to our community), I’d like > to suggest that OpenNLP tools allow users to add optional information to each > tokenized word. > For example, one can give the following text when using NameFinder tools. > {code} > $ cat en-ner.train > I/PPSS sent/VBD the/AT machine/NP > {code} > When using such text, they must inform the tool that the token has POS tag in > the text by using a certain option e.g. -postag > {code} > $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag > {code} > We can maintain the backward compatibility to set -postag false by default > and in this case, existing feature generators work exactly the same as > before. If a user set -postag option in the command line, the existing > feature generators eliminate “/POS” part of token “word/POS” in the text so > that they can produce same features as before. > I’d like to add a simple feature generator which generates only “POS” part of > token “word/POS” in the text, in addition to managing -postag option. This > simple feature generator allows Japanese/Chinese users to produce precise POS > features. > I’d like to focus on NameFinder in this ticket (Let me add this option to > other tools (chunker, classifier, etc.) in another ticket, if needed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)