Introduce Japanese field type in schema.xml
-------------------------------------------
Key: SOLR-3056
URL: https://issues.apache.org/jira/browse/SOLR-3056
Project: Solr
Issue Type: New Feature
Components: Schema and Analysis
Affects Versions: 3.6, 4.0
Reporter: Christian Moen
Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again
Robert, Uwe and Simon). It would be very good to get a default field type
defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box
support in Solr.
I've been playing with the below configuration today, which I think is a
reasonable starting point for Japanese. There's lot to be said about various
considerations necessary when searching Japanese, but perhaps a wiki page is
more suitable to cover the wider topic?
In order to make the below {{text_ja}} field type work, Kuromoji itself and its
analyzers need to be seen by the Solr classloader. However, these are
currently in contrib and I'm wondering if we should consider moving them to
core to make them directly available. If there are concerns with additional
memory usage, etc. for non-Japanese users, we can make sure resources are
loaded lazily and only when needed in factory-land.
Any thoughts?
{code:xml}
<!-- Text field type is suitable for Japanese text using morphological analysis
NOTE: Please copy files
contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
dist/apache-solr-analysis-extras-x.y.z.jar
to your Solr lib directory (i.e. example/solr/lib) before before starting
Solr.
(x.y.z refers to a version number)
If you would like to optimize for precision, default operator AND with
<solrQueryParser defaultOperator="AND"/>
below (this file). Use "OR" if you would like to optimize for recall
(default).
-->
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100"
autoGeneratePhraseQueries="false">
<analyzer>
<!-- Kuromoji Japanese morphological analyzer/tokenizer
Use search-mode to get a noun-decompounding effect useful for search.
Example:
関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際
(International) 空港 (airport)
so we get a match for 空港 (airport) as we would expect from a good
search engine
Valid values for mode are:
normal: default segmentation
search: segmentation useful for search (extra compound splitting)
extended: search mode with unigramming of unknown words (experimental)
NOTE: Search mode improves segmentation for search at the expense of
part-of-speech accuracy
-->
<tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
<!-- Reduces inflected verbs and adjectives to their base/dectionary forms
(辞書形) -->
<filter class="solr.KuromojiBaseFormFilterFactory"/>
<!-- Optionally remove tokens with certain part-of-speeches
<filter class="solr.KuromojiPartOfSpeechStopFilterFactory"
tags="stopTags.txt" enablePositionIncrements="true"/> -->
<!-- Normalizes full-width romaji to half-with and half-width kana to
full-width (Unicode NFKC subset) -->
<filter class="solr.CJKWidthFilterFactory"/>
<!-- Lower-case romaji characters -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
{code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]