Tomoko Uchida created LUCENE-7273: ------------------------------------- Summary: New kuromoji TokenFilter to keep tokens by part-of-speech tags Key: LUCENE-7273 URL: https://issues.apache.org/jira/browse/LUCENE-7273 Project: Lucene - Core Issue Type: Improvement Components: modules/analysis Reporter: Tomoko Uchida Priority: Minor
Kuromoji has JapanesePartOfSpeechStopFilter to drop tokens by their part-of-speech tags. In some cases, it would be convenient to keep tokens according to "keep" POS tags list. Example usage: {code:java} // keeps proper nouns - location names only String[] tags = new String[]{"名詞-固有名詞-地域-一般"}; Set<String> keeptags = new HashSet<>(); for (String tag: tags) { keeptags.add(tag); } JapaneseTokenizer tokenizer = new JapaneseTokenizer(null, false, JapaneseTokenizer.Mode.SEARCH); JapanesePartOfSpeechKeepFilter stream = new JapanesePartOfSpeechKeepFilter(tokenizer, keeptags); {code} {code:xml} <!-- (Solr) analyzer definition --> <fieldType name="text_ja_propernoun" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false"> <analyzer> <tokenizer class="solr.JapaneseTokenizerFactory" mode="normal"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.JapanesePartOfSpeechKeepFilterFactory" tags="lang/keeptags_ja.txt" /> <filter class="solr.JapaneseKatakanaStemFilterFactory" minimumLength="4"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> {code} Of course it can be achieved by using JapanesePartOfSpeechStopFilter, however because there are about 70 part-of-speeches, it can be cumbersome to list all stop tags to keep tokens with few POS tags of interest. I'll add a patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org