[jira] [Created] (LUCENE-7273) New kuromoji TokenFilter to keep tokens by part-of-speech tags

Tomoko Uchida (JIRA) Thu, 05 May 2016 06:26:43 -0700

Tomoko Uchida created LUCENE-7273:
-------------------------------------

             Summary: New kuromoji TokenFilter to keep tokens by part-of-speech 
tags
                 Key: LUCENE-7273
                 URL: https://issues.apache.org/jira/browse/LUCENE-7273
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Tomoko Uchida
            Priority: Minor



Kuromoji has JapanesePartOfSpeechStopFilter to drop tokens by their 
part-of-speech tags. In some cases, it would be convenient to keep tokens 
according to "keep" POS tags list.

Example usage:
{code:java}
// keeps proper nouns - location names only
String[] tags = new String[]{"名詞-固有名詞-地域-一般"};
Set<String> keeptags = new HashSet<>();
for (String tag: tags) {
  keeptags.add(tag);
}
JapaneseTokenizer tokenizer = new JapaneseTokenizer(null, false, 
JapaneseTokenizer.Mode.SEARCH);
JapanesePartOfSpeechKeepFilter stream = new 
JapanesePartOfSpeechKeepFilter(tokenizer, keeptags);
{code}

{code:xml}
<!-- (Solr) analyzer definition -->
<fieldType name="text_ja_propernoun" class="solr.TextField" 
positionIncrementGap="100" 
           autoGeneratePhraseQueries="false">
    <analyzer>
        <tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>
        <filter class="solr.CJKWidthFilterFactory"/>
        <filter class="solr.JapanesePartOfSpeechKeepFilterFactory" 
tags="lang/keeptags_ja.txt" />
        <filter class="solr.JapaneseKatakanaStemFilterFactory" 
minimumLength="4"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>
{code}

Of course it can be achieved by using JapanesePartOfSpeechStopFilter, however 
because there are about 70 part-of-speeches, it can be cumbersome to list all 
stop tags to keep tokens with few POS tags of interest.

I'll add a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-7273) New kuromoji TokenFilter to keep tokens by part-of-speech tags

Reply via email to