[
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194413#comment-13194413
]
Robert Muir commented on SOLR-3056:
-----------------------------------
I opened LUCENE-3726 for the search mode discussion.
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
> Key: SOLR-3056
> URL: https://issues.apache.org/jira/browse/SOLR-3056
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Affects Versions: 3.6, 4.0
> Reporter: Christian Moen
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again
> Robert, Uwe and Simon). It would be very good to get a default field type
> defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box
> support in Solr.
> I've been playing with the below configuration today, which I think is a
> reasonable starting point for Japanese. There's lot to be said about various
> considerations necessary when searching Japanese, but perhaps a wiki page is
> more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and
> its analyzers need to be seen by the Solr classloader. However, these are
> currently in contrib and I'm wondering if we should consider moving them to
> core to make them directly available. If there are concerns with additional
> memory usage, etc. for non-Japanese users, we can make sure resources are
> loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological
> analysis
> NOTE: Please copy files
> contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
> dist/apache-solr-analysis-extras-x.y.z.jar
> to your Solr lib directory (i.e. example/solr/lib) before before
> starting Solr.
> (x.y.z refers to a version number)
> If you would like to optimize for precision, default operator AND with
> <solrQueryParser defaultOperator="AND"/>
> below (this file). Use "OR" if you would like to optimize for recall
> (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="false">
> <analyzer>
> <!-- Kuromoji Japanese morphological analyzer/tokenizer
> Use search-mode to get a noun-decompounding effect useful for search.
> Example:
> 関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際
> (International) 空港 (airport)
> so we get a match for 空港 (airport) as we would expect from a good
> search engine
> Valid values for mode are:
> normal: default segmentation
> search: segmentation useful for search (extra compound splitting)
> extended: search mode with unigramming of unknown words
> (experimental)
> NOTE: Search mode improves segmentation for search at the expense of
> part-of-speech accuracy
> -->
> <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
> <!-- Reduces inflected verbs and adjectives to their base/dectionary
> forms (辞書形) -->
> <filter class="solr.KuromojiBaseFormFilterFactory"/>
> <!-- Optionally remove tokens with certain part-of-speeches
> <filter class="solr.KuromojiPartOfSpeechStopFilterFactory"
> tags="stopTags.txt" enablePositionIncrements="true"/> -->
> <!-- Normalizes full-width romaji to half-with and half-width kana to
> full-width (Unicode NFKC subset) -->
> <filter class="solr.CJKWidthFilterFactory"/>
> <!-- Lower-case romaji characters -->
> <filter class="solr.LowerCaseFilterFactory"/>
> </analyzer>
> </fieldType>
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]