[ 
https://issues.apache.org/jira/browse/LUCENE-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17417411#comment-17417411
 ] 

Julie Tibshirani commented on LUCENE-10102:
-------------------------------------------

I noticed a randomized test failure pop up. Here's an example reproduction line 
and stack trace:
{code:java}
./gradlew test --tests TestFactories.test -Dtests.seed=4F3A8742C547BA6A 
-Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=shi-Tfng-MA -Dtests.timezone=Asia/Jerusalem -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
{code}
{code:java}
    java.lang.IllegalStateException: incrementToken() called while in wrong 
state: INCREMENT_FALSE
        at 
__randomizedtesting.SeedInfo.seed([4F3A8742C547BA6A:C76EB8986BBBD792]:0)
        at org.apache.lucene.analysis.MockTokenizer.fail(MockTokenizer.java:135)
        at 
org.apache.lucene.analysis.MockTokenizer.incrementToken(MockTokenizer.java:146)
        at 
org.apache.lucene.analysis.ja.JapaneseCompletionFilter.mayIncrementToken(JapaneseCompletionFilter.java:114)
        ...{code}
 

> Add JapaneseCompletionFilter for Input Method-aware auto-completion
> -------------------------------------------------------------------
>
>                 Key: LUCENE-10102
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10102
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Assignee: Tomoko Uchida
>            Priority: Major
>             Fix For: main (9.0)
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> +Basic background information+
> As you know, Japanese texts are written in Kanji (ideogram), Katakana, 
> Hiragana (phonetic symbols), and their combination. Therefore it is desirable 
> for intelligent auto-completion systems to treat various representations; one 
> common practice we use is - translate all inputs to "romanized form" 
> ([https://en.wikipedia.org/wiki/Romanization_of_Japanese]) then reduce the 
> problem to simple Latin-alphabet string matching.
>  For example: if a word "桜" (surface form) is given, we first convert it to 
> "サクラ" (reading form) then further translate it to "sakura" (romanized form) 
> so that we can suggest an auto-complete keyword "桜" for an incomplete query 
> "さ" or "サ" or  "sa".
>  
> +The difficulties+
>  A simplistic approach to implementing such romanization-based 
> auto-completion is to use JapaneseReadingFormFilter (this has "useRomaji" 
> option). Unfortunately, this off-the-shelf method doesn't work due not to its 
> fault - but complex combinations of multiple romanization systems and IMEs 
> ([https://en.wikipedia.org/wiki/Input_method]). It is a little difficult for 
> me to explain their detailed specifications in English, but let me provide 
> some examples.
> 1) Multiple romanization systems
>  There are three major romanization systems - modified Hepburn-shiki, 
> Kunrei-shiki (Nihon-shiki) and Wāpuro shiki. JapaneseReadingFormFilter 
> supports only modified Hepburn-shiki, so it isn't sufficient to cover all 
> possible romanized forms.
>  e.g.; "新橋" can be translated into eight romanized forms (in theory) - 
> "sinbasi", "shinbasi", "sinnbasi", "shinnbasi", "sinbashi", "shinbashi", 
> "sinnbashi", and "shinnbashi".
> 2) interaction with Input Method
>  When querying, mid-IME composition strings will be sent to the search 
> systems, and auto-complete systems should handle them (or, it may just ignore 
> such inputs, but it hurts users' experience). 
>  e.g.; "会sy" can be an input to an auto-completion system. If we have a 
> method to translate it to "kaisy", we can suggest "会社" (kaisya).
>  
> +Solution+
>  I implemented a token filter (and added an analyzer for ease of use) that 
> handles those two challenges. With this filter, we can utilize 
> AnalysingSuggester for fast automaton-based auto-completion for Japanese.
>  (Though I acknowledged it contains some peculiar logic, I suppose those are 
> required complexities for a tool that deals with the intricacy of natural 
> language systems...)
>  
> +Note+
>  * The filter has worked well for us on a production system with 
> moderate-sized business users (1000~) for one year, and I've fixed some weird 
> bugs we've encountered so far. Also, the donation of the code was granted by 
> the managers.
>  * There is one missing thing - offset correction. I found correct offset 
> calculation is not required for auto-completion use-cases, but I'm trying to 
> emit the correct offsets for completeness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to