[ https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018470#comment-17018470 ]
Kazuaki Hiraga commented on LUCENE-9123: ---------------------------------------- {quote} I thought the change in the behavior has very small or no impact for users who use the Tokenizer for searching, but yes it would affect users who use it for pure tokenization purpose. {quote} Yes, I think that's a point that changing default behavior affects Solr/Elastic users as well if these products doesn't change the parameter. But you may be right that we can change the default behavior. I have no idea... {quote} How about this proposal: we can create two patches, one for the master and one for 8x. On 8x branch, add the new constructor so you can use it from the next update. There is no change in the default behavior. On the master branch, switch the default behavior (users who don't like the change can still swich back by using the full constructor). {quote} OK. I will prepare another patch for the master branch. Then, a person who is a maintainer of Japanese Tokenizer can choose how to merge the changes (who is responsible for Japanese Tokenizer for now?) > JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter > ----------------------------------------------------------------------- > > Key: LUCENE-9123 > URL: https://issues.apache.org/jira/browse/LUCENE-9123 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis > Affects Versions: 8.4 > Reporter: Kazuaki Hiraga > Priority: Major > Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch > > > JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with > both of SynonymGraphFilter and SynonymFilter when JT generates multiple > tokens as an output. If we use `mode=normal`, it should be fine. However, we > would like to use decomposed tokens that can maximize to chance to increase > recall. > Snippet of schema: > {code:xml} > <fieldType name="text_custom_ja" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer> > <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/> > <filter class="solr.SynonymGraphFilterFactory" > synonyms="lang/synonyms_ja.txt" > tokenizerFactory="solr.JapaneseTokenizerFactory"/> > <filter class="solr.JapaneseBaseFormFilterFactory"/> > <!-- Removes tokens with certain part-of-speech tags --> > <filter class="solr.JapanesePartOfSpeechStopFilterFactory" > tags="lang/stoptags_ja.txt" /> > <!-- Normalizes full-width romaji to half-width and half-width kana > to full-width (Unicode NFKC subset) --> > <filter class="solr.CJKWidthFilterFactory"/> > <!-- Removes common tokens typically not useful for search, but have > a negative effect on ranking --> > <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" > words="lang/stopwords_ja.txt" /> --> > <!-- Normalizes common katakana spelling variations by removing any > last long sound character (U+30FC) --> > <filter class="solr.JapaneseKatakanaStemFilterFactory" > minimumLength="4"/> > <!-- Lower-cases romaji characters --> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > {code} > An synonym entry that generates error: > {noformat} > 株式会社,コーポレーション > {noformat} > The following is an output on console: > {noformat} > $ ./bin/solr create_core -c jp_test -d ../config/solrconfs > ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] > Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 > (got: 0) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org