[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018114#comment-17018114
 ] 

Michael McCandless commented on LUCENE-9123:
--------------------------------------------

This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?

Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)?  (Since that seems to be where the error is 
occurring here).  Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji).

Net/net it's disappointing that neither synonym filter nor synonym graph filter 
can correctly consume an incoming token graph; it'd be somewhat tricky to fix, 
but is important.  I thought we had a dedicated issue for that but I cannot 
locate it now.

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.4
>            Reporter: Kazuaki Hiraga
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" 
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
> tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana 
> to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have 
> a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any 
> last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to