[
https://issues.apache.org/jira/browse/LUCENE-10102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415585#comment-17415585
]
Tomoko Uchida commented on LUCENE-10102:
----------------------------------------
I would like to acknowledge the extensive & precise work of these sites
owners/authors; I used them as references when implementing KatakanaRomanizer
(the core part).
* [ローマ字 あいうえお|https://green.adam.ne.jp/roomazi/index.html]
* [日本語のローマ字表記の推奨形式 -
東京大学|http://park.itc.u-tokyo.ac.jp/eigo/UT-Komaba-Nihongo-no-romaji-hyoki-v1.pdf]
* [ローマ字の長音のつづり方|http://xembho.s59.xrea.com/siryoo/hikion.html]
* [各種ローマ字表の比較|http://jgrammar.life.coocan.jp/ja/data/rohmaji2.htm]
* [現代かなづかい
(昭和21年内閣告示第33号)|https://ja.wikisource.org/wiki/%E7%8F%BE%E4%BB%A3%E3%81%8B%E3%81%AA%E3%81%A5%E3%81%8B%E3%81%84_(%E6%98%AD%E5%92%8C21%E5%B9%B4%E5%86%85%E9%96%A3%E5%91%8A%E7%A4%BA%E7%AC%AC33%E5%8F%B7)]
> Add JapaneseCompletionFilter for Input Method-aware auto-completion
> -------------------------------------------------------------------
>
> Key: LUCENE-10102
> URL: https://issues.apache.org/jira/browse/LUCENE-10102
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Assignee: Tomoko Uchida
> Priority: Major
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> +Basic background information+
> As you know, Japanese texts are written in Kanji (ideogram), Katakana,
> Hiragana (phonetic symbols), and their combination. Therefore it is desirable
> for intelligent auto-completion systems to treat various representations; one
> common practice we use is - translate all inputs to "romanized form"
> ([https://en.wikipedia.org/wiki/Romanization_of_Japanese]) then reduce the
> problem to simple Latin-alphabet string matching.
> For example: if a word "桜" (surface form) is given, we first convert it to
> "サクラ" (reading form) then further translate it to "sakura" (romanized form)
> so that we can suggest an auto-complete keyword "桜" for an incomplete query
> "さ" or "サ" or "sa".
>
> +The difficulties+
> A simplistic approach to implementing such romanization-based
> auto-completion is to use JapaneseReadingFormFilter (this has "useRomaji"
> option). Unfortunately, this off-the-shelf method doesn't work due not to its
> fault - but complex combinations of multiple romanization systems and IMEs
> ([https://en.wikipedia.org/wiki/Input_method]). It is a little difficult for
> me to explain their detailed specifications in English, but let me provide
> some examples.
> 1) Multiple romanization systems
> There are three major romanization systems - modified Hepburn-shiki,
> Kunrei-shiki (Nihon-shiki) and Wāpuro shiki. JapaneseReadingFormFilter
> supports only modified Hepburn-shiki, so it isn't sufficient to cover all
> possible romanized forms.
> e.g.; "新橋" can be translated into eight romanized forms (in theory) -
> "sinbasi", "shinbasi", "sinnbasi", "shinnbasi", "sinbashi", "shinbashi",
> "sinnbashi", and "shinnbashi".
> 2) interaction with Input Method
> When querying, mid-IME composition strings will be sent to the search
> systems, and auto-complete systems should handle them (or, it may just ignore
> such inputs, but it hurts users' experience).
> e.g.; "会sy" can be an input to an auto-completion system. If we have a
> method to translate it to "kaisy", we can suggest "会社" (kaisya).
>
> +Solution+
> I implemented a token filter (and added an analyzer for ease of use) that
> handles those two challenges. With this filter, we can utilize
> AnalysingSuggester for fast automaton-based auto-completion for Japanese.
> (Though I acknowledged it contains some peculiar logic, I suppose those are
> required complexities for a tool that deals with the intricacy of natural
> language systems...)
>
> +Note+
> * The filter has worked well for us on a production system with
> moderate-sized business users (1000~) for one year, and I've fixed some weird
> bugs we've encountered so far. Also, the donation of the code was granted by
> the managers.
> * There is one missing thing - offset correction. I found correct offset
> calculation is not required for auto-completion use-cases, but I'm trying to
> emit the correct offsets for completeness.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]