Tomoko Uchida created LUCENE-8816:
-------------------------------------

             Summary: Decouple Kuromoji's morphological analyser and its 
dictionary
                 Key: LUCENE-8816
                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
             Project: Lucene - Core
          Issue Type: Improvement
          Components: modules/analysis
            Reporter: Tomoko Uchida


I've inspired by this mail-list thread.
 
[http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]

As many Japanese already know, default built-in dictionary bundled with 
Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
While it have been slowly obsoleted, well-maintained and/or extended 
dictionaries risen up in recent years (e.g. 
[mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
attempts/projects/efforts are made in Japan.

However current architecture - dictionary bundled jar - is essentially 
incompatible with the idea "switch the system dictionary", and developers have 
difficulties to do so.

Traditionally, the morphological analysis engine (viterbi logic) and the 
encoded dictionary (language model) had been decoupled (like MeCab, the origin 
of Kuromoji, or lucene-gosen). So actually decoupling them is a natural idea, 
and I feel that it's good time to re-think the current architecture.

Also this would be good for advanced users who have customized/re-trained their 
own system dictionary.

Goals of this issue:
 * Decouple JapaneseTokenizer itself and encoded system dictionary.
 * Implement dynamic dictionary load mechanism.
 * Provide developer-oriented dictionary build tool.

Non-goals:

 - Provide learner or language model (it's up to users and should be outside 
the scope).

I have not dove into the code yet, so have no idea about it's easy or difficult 
at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to