[ 
https://issues.apache.org/jira/browse/LUCENE-3901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235746#comment-13235746
 ] 

Christian Moen commented on LUCENE-3901:
----------------------------------------

Find attached a patch for this.

The stemming is done by {{KuromojiKatakanaStemFilter}}, which has been added to 
{{KuromojiAnalyzer}} and a corresponding {{KuromojiKatakanaStemFilterFactory}} 
has been added to the {{text_ja}} field type in {{schema.xml}}.

Note that this stemming is now turned on by default and I think it makes good 
sense to do so.  The minimum length of a token considered for stemming is 
configurable and I've made the default of 4 explicit in {{schema.xml}} to 
convey that it's there.

The stemmer only supports full-width katakana and should be used in combination 
with a {{CJKWidthFilter}} if stemming half-width characters is required and 
you're doing your wiring.  Both {{text_ja}} and {{KuromojiAnalyzer}} takes care 
of this, and the default overall processing is the same.

There are some test cases in {{TestKuromojiKatakanaStemFilter}}, but I've added 
a case to {{TestKuromojiAnalyzer}} that demonstrates how the stemming works in 
combination with katakana compound splitting.

In Japanese, "manager" can be written both as マネージャー and マネージャ (and probably 
also マネジャー), and for the compound シニアプロジェクトマネージャー (senior project manager), we 
now get tokens シニア (senior) プロジェクト (project) マネージャ (manager), and we've stemmed 
the last token by removing the trailing ー.  Kuromoji also makes the compound 
シニアプロジェクトマネージャ a synonym to シニア, and ー is also removed for the synonym compound.

Tests pass and I've also tested this end-to-end in a Solr trunk build.
                
> Add katakana stem filter to better deal with certain katakana spelling 
> variants
> -------------------------------------------------------------------------------
>
>                 Key: LUCENE-3901
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3901
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Christian Moen
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3901.patch, LUCENE-3901.patch
>
>
> Many Japanese katakana words end in a long sound that is sometimes optional.
> For example, パーティー and パーティ are both perfectly valid for "party".  Similarly 
> we have センター and センタ that are variants of "center" as well as サーバー and サーバ 
> for "server".
> I'm proposing that we add a katakana stemmer that removes this long sound if 
> the terms are longer than a configurable length.  It's also possible to add 
> the variant as a synonym, but I think stemming is preferred from a ranking 
> point of view.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to