[
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483138#comment-17483138
]
Tomoko Uchida commented on LUCENE-8816:
---------------------------------------
I'm sorry for leaving this for such a long time. We now have stable Gradle
build infrastructure; I think we are ready to restart on this.
I thought only about Kuromoji when I opened this issue, but my mind has
slightly changed since then. Before working with this issue, I wonder if we
should explore the possibility of unifying the dictionary builder/loader of the
Kuromoji and Nori, so that both modules can benefit from the decoupling of data
and analysis engine simultaneously. Also, code duplication is significantly
reduced. The common or base dictionary builder could be placed in
analysis-common.
As for the decoupling, JMS support is also needed; maybe we could have to open
up some packages in the dictionary module to the analysis-common module.
I have to say I still don't have a fully detailed picture (and my progress will
be slow for such extensive refactoring); I would welcome any feedback.
> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Priority: Major
>
> I've inspired by this mail-list thread.
>
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years.
> While it has been slowly obsoleted, well-maintained and/or extended
> dictionaries risen up in recent years (e.g.
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially
> incompatible with the idea "switch the system dictionary", and developers
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the
> encoded dictionary (language model) had been decoupled (like MeCab, the
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a
> natural idea, and I feel that it's good time to re-think the current
> architecture.
> Also this would be good for advanced users who have customized/re-trained
> their own system dictionary.
> Goals of this issue:
> * Decouple JapaneseTokenizer itself and encoded system dictionary.
> * Implement dynamic dictionary load mechanism.
> * Provide developer-oriented dictionary build tool.
> Non-goals:
> * Provide learner or language model (it's up to users and should be outside
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or
> difficult at this moment.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]