[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Kazuaki Hiraga (Jira) Fri, 17 Jan 2020 18:09:22 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018470#comment-17018470
 ]


Kazuaki Hiraga commented on LUCENE-9123:
----------------------------------------

{quote}
I thought the change in the behavior has very small or no impact for users who 
use the Tokenizer for searching, but yes it would affect users who use it for 
pure tokenization purpose.
{quote}

 Yes, I think that's a point that changing default behavior affects 
Solr/Elastic users as well if these products doesn't change the parameter.  But 
you may be right that we can change the default behavior. I have no idea...

{quote}
How about this proposal: we can create two patches, one for the master and one 
for 8x. On 8x branch, add the new constructor so you can use it from the next 
update. There is no change in the default behavior. On the master branch, 
switch the default behavior (users who don't like the change can still swich 
back by using the full constructor).
{quote}

OK. I will prepare another patch for the master branch. Then, a person who is a 
maintainer of Japanese Tokenizer can choose how to merge the changes (who is 
responsible for Japanese Tokenizer for now?)

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.4
>            Reporter: Kazuaki Hiraga
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" 
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
> tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana 
> to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have 
> a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any 
> last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Reply via email to