[ 
https://issues.apache.org/jira/browse/LUCENE-7273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-7273:
----------------------------------
    Attachment: LUCENE-analysis-kuromoji-poskeep.patch

> New kuromoji TokenFilter to keep tokens by part-of-speech tags
> --------------------------------------------------------------
>
>                 Key: LUCENE-7273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7273
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Minor
>         Attachments: LUCENE-analysis-kuromoji-poskeep.patch
>
>
> Kuromoji has JapanesePartOfSpeechStopFilter to drop tokens by their 
> part-of-speech tags. In some cases, it would be convenient to keep tokens 
> according to "keep" POS tags list.
> Example usage:
> {code:java}
> // keeps proper nouns - location names only
> String[] tags = new String[]{"名詞-固有名詞-地域-一般"};
> Set<String> keeptags = new HashSet<>();
> for (String tag: tags) {
>   keeptags.add(tag);
> }
> JapaneseTokenizer tokenizer = new JapaneseTokenizer(null, false, 
> JapaneseTokenizer.Mode.SEARCH);
> JapanesePartOfSpeechKeepFilter stream = new 
> JapanesePartOfSpeechKeepFilter(tokenizer, keeptags);
> {code}
> {code:xml}
> <!-- (Solr) analyzer definition -->
> <fieldType name="text_ja_propernoun" class="solr.TextField" 
> positionIncrementGap="100" 
>            autoGeneratePhraseQueries="false">
>     <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/>
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <filter class="solr.JapanesePartOfSpeechKeepFilterFactory" 
> tags="lang/keeptags_ja.txt" />
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
> </fieldType>
> {code}
> Of course it can be achieved by using JapanesePartOfSpeechStopFilter, however 
> because there are about 70 part-of-speeches, it can be cumbersome to list all 
> stop tags to keep tokens with few POS tags of interest.
> I'll add a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to