[jira] [Commented] (SOLR-3056) Introduce Japanese field type in schema.xml

Robert Muir (Commented) (JIRA) Sun, 22 Jan 2012 11:37:05 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-3056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190747#comment-13190747
 ]


Robert Muir commented on SOLR-3056:
-----------------------------------

{quote}
It would be very good to get a default field type defined for Japanese in 
schema.xml so we can good Japanese out-of-the-box support in Solr.
{quote}

I agree, we really need this for all languages, including stopwords_xx files 
and fieldtypes actually,
but lets start with japanese because its complicated.

{quote}
I've been playing with the below configuration today, which I think is a 
reasonable starting point for Japanese. There's lot to be said about various 
considerations necessary when searching Japanese, but perhaps a wiki page is 
more suitable to cover the wider topic?
{quote}

I think the ideal situation would be to have a single reasonable default (like 
the configuration you have), but then also a 
full wiki page on Kuromoji explaining the different options, maybe even with 
alternative configurations or examples. we could
link to this page from the other wikipages about the analyzers.

{quote}
In order to make the below text_ja field type work, Kuromoji itself and its 
analyzers need to be seen by the Solr classloader. However, these are currently 
in contrib and I'm wondering if we should consider moving them to core to make 
them directly available. If there are concerns with additional memory usage, 
etc. for non-Japanese users, we can make sure resources are loaded lazily and 
only when needed in factory-land.
{quote}

Yeah I don't think having kuromoji in contrib is ideal. I think instead we 
should have examples for all supported languages
so its easy to get started. Currently someone has to jump thru serious hoops to 
segment chinese or japanese into words,
but as I mentioned before all non-english languages currently are 'hard' in 
that there are no fieldtypes setup for them.

but anyway, my vote is to move these analyzers to core and nuke this contrib 
totally. But it would be great for some
people to speak up and get consensus on this because it would only be more 
confusing to go back and forth between
contrib and core.


As far as the default configuration,

Christian maybe if you have some time you could look at/review the stopTags.txt 
we have in the analyzer right now? 

http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/kuromoji/src/resources/org/apache/lucene/analysis/kuromoji/stoptags.txt?view=markup

I created this file from the ipadic manual (there could/likely are silly errors 
too), in an attempt to also document the POS tagset.
But we should also see if the uncommented POS tags in that file are appropriate 
for a 'good stop set'. I think i just arbitrarily
picked a few trying to be conservative.


                
> Introduce Japanese field type in schema.xml
> -------------------------------------------
>
>                 Key: SOLR-3056
>                 URL: https://issues.apache.org/jira/browse/SOLR-3056
>             Project: Solr
>          Issue Type: New Feature
>          Components: Schema and Analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>
> Kuromoji (LUCENE-3305) is now on both on trunk and branch_3x (thanks again 
> Robert, Uwe and Simon). It would be very good to get a default field type 
> defined for Japanese in {{schema.xml}} so we can good Japanese out-of-the-box 
> support in Solr.
> I've been playing with the below configuration today, which I think is a 
> reasonable starting point for Japanese.  There's lot to be said about various 
> considerations necessary when searching Japanese, but perhaps a wiki page is 
> more suitable to cover the wider topic?
> In order to make the below {{text_ja}} field type work, Kuromoji itself and 
> its analyzers need to be seen by the Solr classloader.  However, these are 
> currently in contrib and I'm wondering if we should consider moving them to 
> core to make them directly available.  If there are concerns with additional 
> memory usage, etc. for non-Japanese users, we can make sure resources are 
> loaded lazily and only when needed in factory-land.
> Any thoughts?
> {code:xml}
> <!-- Text field type is suitable for Japanese text using morphological 
> analysis
>      NOTE: Please copy files
>        contrib/analysis-extras/lucene-libs/lucene-kuromoji-x.y.z.jar
>        dist/apache-solr-analysis-extras-x.y.z.jar
>      to your Solr lib directory (i.e. example/solr/lib) before before 
> starting Solr.
>      (x.y.z refers to a version number)
>      If you would like to optimize for precision, default operator AND with
>        <solrQueryParser defaultOperator="AND"/>
>      below (this file).  Use "OR" if you would like to optimize for recall 
> (default).
> -->
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" 
> autoGeneratePhraseQueries="false">
>   <analyzer>
>     <!-- Kuromoji Japanese morphological analyzer/tokenizer
>          Use search-mode to get a noun-decompounding effect useful for search.
>          Example:
>            関西国際空港 (Kansai International Airpart) becomes 関西 (Kansai) 国際 
> (International) 空港 (airport)
>            so we get a match for 空港 (airport) as we would expect from a good 
> search engine
>          Valid values for mode are:
>             normal: default segmentation
>             search: segmentation useful for search (extra compound splitting)
>           extended: search mode with unigramming of unknown words 
> (experimental)
>          NOTE: Search mode improves segmentation for search at the expense of 
> part-of-speech accuracy
>     -->
>     <tokenizer class="solr.KuromojiTokenizerFactory" mode="search"/>
>     <!-- Reduces inflected verbs and adjectives to their base/dectionary 
> forms (辞書形) -->      
>     <filter class="solr.KuromojiBaseFormFilterFactory"/>
>     <!-- Optionally remove tokens with certain part-of-speeches
>     <filter class="solr.KuromojiPartOfSpeechStopFilterFactory" 
> tags="stopTags.txt" enablePositionIncrements="true"/> -->
>     <!-- Normalizes full-width romaji to half-with and half-width kana to 
> full-width (Unicode NFKC subset) -->
>     <filter class="solr.CJKWidthFilterFactory"/>
>     <!-- Lower-case romaji characters -->
>     <filter class="solr.LowerCaseFilterFactory"/>
>   </analyzer>
> </fieldType>
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-3056) Introduce Japanese field type in schema.xml

Reply via email to