[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16851556#comment-16851556
 ] 

Tomoko Uchida commented on LUCENE-8816:
---------------------------------------

I looked over the diffs between current Kuromoji and Nori code, to examine we 
are able to integrate those.

There are many overlaps (copied lines) but also a lot differences which cannot 
be merged easily. (I've not studied about the details, but there may have been 
Korean dictionary/tokenizer specific modifications?)

I've heard MeCab and MeCab IPADIC was designed to be language independent, but 
it seems things are not as easy as one sees. :-)

I think the generalization/integration (if it's possible) should be treated in 
different issues. I'd like to propose following:
 - Keep JapaneseTokenizer and KoreanTokenizer as individual tokenizers (as is). 
Merging them seems not to be feasible to me, but others might have good 
solutions or ideas for it.
 - First, decouple the encoded system dictionary (mecab-ipadic) to a separated 
jar from the kuromoji jar and clean up the dictionary builder tool. This is the 
scope of this issue.
 - Then generalize the dictionary builder tool to make it able to handle Korean 
dictionary (mecab-ko-dic), on the separated issue.
 - Lastly decouple the korean system dictionary to a separated jar from the 
nori jar, maybe on the another issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to