[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Alan Woodward (Jira) Fri, 24 Jan 2020 02:58:25 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022865#comment-17022865
 ]


Alan Woodward commented on LUCENE-9123:
---------------------------------------

> It's just a rather tricky fix, though I think we added a base class to make 
>working with incoming graph tokens easier, which could make that fix also 
>easier.

This is GraphTokenStream, which does indeed make consuming graphs easier.  The 
problem isn't going to be working with the input, though, it's going to be 
dealing with output.  Currently there is no good way to insert arbitrary graphs 
into a token stream that already contains multiple paths without breaking those 
existing paths.

Let's say we have an incoming stream with a token ABC that has been 
decompounded into 'A B C':

ABC(1, 3) A(0, 1) B(1, 1), C(1, 1) D(1, 1)

Here ABC is the term, and the numbers in the brackets are the position 
increment and position length of the token.  Note that 'ABC' has a position 
length of three, which if followed points to the token 'D'

We now insert a multiterm synonym for B = B Q Q

ABC(1, 3) A(0, 1) B(1, 3) B(0, 1) Q(1, 1) Q(1, 1) C(1, 1) D(1, 1)

Because of the extra tokens inserted in the 'A B C' branch of the existing 
graph, the position length of the 'ABC' term now points to 'C' instead of 'D'.  
But by the time we've reached 'B' in the tokenstream, 'ABC' has already been 
emitted so we can't go back and adjust it's position length.

I think the solution is going to have to involve changes to the TokenStream API 
at query time; the current posinc+poslen encoding isn't flexible enough, and 
breaks in confusing ways when you try and modify things mid-stream.  What we do 
an index time is even trickier, because we can't encode graphs in the index 
without making things like phrase queries or interval queries much more complex 
(and probably a lot slower).  We could enforce the graph token filters to only 
be applied at query time, but that leaves the question of how to deal with 
decompounding filters.

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9123
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9123
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 8.4
>            Reporter: Kazuaki Hiraga
>            Assignee: Tomoko Uchida
>            Priority: Major
>         Attachments: LUCENE-9123.patch, LUCENE-9123_8x.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>     <fieldType name="text_custom_ja" class="solr.TextField" 
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer>
>         <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>         <filter class="solr.SynonymGraphFilterFactory"
>                     synonyms="lang/synonyms_ja.txt"
>                     tokenizerFactory="solr.JapaneseTokenizerFactory"/>
>         <filter class="solr.JapaneseBaseFormFilterFactory"/>
>         <!-- Removes tokens with certain part-of-speech tags -->
>         <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
> tags="lang/stoptags_ja.txt" />
>         <!-- Normalizes full-width romaji to half-width and half-width kana 
> to full-width (Unicode NFKC subset) -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- Removes common tokens typically not useful for search, but have 
> a negative effect on ranking -->
>         <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="lang/stopwords_ja.txt" /> -->
>         <!-- Normalizes common katakana spelling variations by removing any 
> last long sound character (U+30FC) -->
>         <filter class="solr.JapaneseKatakanaStemFilterFactory" 
> minimumLength="4"/>
>         <!-- Lower-cases romaji characters -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldType>
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

Reply via email to