[jira] [Created] (LUCENE-10102) Add JapaneseCompletionFilter for Input Method-aware auto-completion

Tomoko Uchida (Jira) Tue, 14 Sep 2021 00:34:05 -0700

Tomoko Uchida created LUCENE-10102:
--------------------------------------

             Summary: Add JapaneseCompletionFilter for Input Method-aware 
auto-completion
                 Key: LUCENE-10102
                 URL: https://issues.apache.org/jira/browse/LUCENE-10102
             Project: Lucene - Core
          Issue Type: Task
          Components: modules/analysis
            Reporter: Tomoko Uchida
            Assignee: Tomoko Uchida



+Basic background information+

As you know, Japanese texts are written in Kanji (ideogram), Katakana, Hiragana 
(phonetic symbols), and their combination. Therefore it is desirable for 
intelligent auto-completion systems to treat various representations; one 
common practice we use is - translate all inputs to "romanized form" 
([https://en.wikipedia.org/wiki/Romanization_of_Japanese]) and reduce the 
problem to simple Latin-alphabet string matching.
 For example: if a word "桜" (surface form) is given, we first convert it to 
"サクラ" (reading form) then further translate it to "sakura" (romanized form) so 
that we can suggest an auto-complete keyword "sakura" for an incomplete query 
"as".

 

+The difficulties+
 A simplistic approach to implementing such romanization-based auto-completion 
is to use JapaneseReadingFormFilter (this has "useRomaji" option). 
Unfortunately, this out-of-the-box method doesn't work due not to its fault - 
but complex combinations of multiple romanization systems and IMEs. It is a 
little difficult for me to explain their detailed specifications in English, 
but let me provide some examples.

1) Multiple romanization systems
 There are three major romanization systems - modified Hepburn-shiki, 
Kunrei-shiki (Nihon-shiki) and Wāpuro shiki. JapaneseReadingFormFilter supports 
only modified Hepburn-shiki, so it isn't sufficient to cover all possible 
romanized forms.
 e.g.; "新橋" can be translated into eight romanized forms (in theory) - 
"sinbasi", "shinbasi", "sinnbasi", "shinnbasi", "sinbashi", "shinbashi", 
"sinnbashi", and "shinnbashi".

2) interaction with Input Method
 When querying, mid-IME composition strings will be sent to the search systems, 
and auto-complete systems should handle them (or, it may just ignore such 
inputs, but it hurts users' experience). 
 e.g.; "会ｓｙ" can be an input to an auto-completion system. If we have a method 
to translate it to "kaisy", we can suggest "会社" (kaisya).

 

+Solution+
 I implemented a token filter (and added an analyzer for ease of use) that 
handles those two challenges. With this filter, we can utilize 
AnalysingSuggester for fast automaton-based auto-completion for Japanese.
 (Though I acknowledged it contains some peculiar logic, I suppose those are 
required complexities for a tool that deals with the intricacy of natural 
language systems...)

 

+Note+
 * The filter has worked well for us on a production system with moderate-sized 
business users (1000~) for one year, and I've fixed some weird bugs we've 
encountered so far. Also, the donation of the code was granted by the managers.
 * There is one missing thing - offset correction. I found correct offset 
calculation is not required for auto-completion use-cases, but I'm trying to 
emit the correct offsets for completeness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-10102) Add JapaneseCompletionFilter for Input Method-aware auto-completion

Reply via email to