[
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849840#comment-16849840
]
Tomoko Uchida commented on LUCENE-8816:
---------------------------------------
Hi [~jim.ferenczi],
I just skimmed your comments and will try to answer your questions. Correct me
if I missed your points.
{quote}This could be a nice cleanup if the goal is to handle multiple mecab
dictionaries (in different languages).
{quote}
I didn't think that but of course it seems good unification to me.
{quote}While allowing more flexibility would be nice I wonder if there are that
many different dictionaries.
{quote}
I my view, there are only two dictionary formats we should support (MeCab
IPADIC and UniDic). There are some other old dictionaries - NAIST-jdic or
ChaSen ipadic - but they are completely obsolete now and rarely used (as far as
I know).
There are several well-known extensions of mecab-ipadic and unidic, so we can
support almost all common variants (in Japan) by supporting those.
{quote}If the ipadic is obsolete we could also adapt the main distribution
(kuromoji) to use the UniDic instead.
{quote}
Yes, I think so. But I am not sure that we should select UniDic as default
immediately. While users often complain about MeCab IPADIC, it is still
high-quality and widely accepted dictionary. And even when we change the
default dictionary to UniDic (I think it is definitely OK after giving some
time to users for migrating/testing their applications), we have to provide the
option to use old MeCab IPADIC for users who trust it and do not need
"contemporary words".
{quote}Even if we handle multiple dictionaries we'll still need to provide a
way for users to add custom entries. Mecab has an option to compute the leftId,
rightId and cost automatically from a partial user entry so I wonder if this
could help to avoid users to reimplement a dictionary from scratch ?
{quote}
Yes, custom entries (user dictionaries) are needed option to customize
tokenization behaviour, for uses with little NLP skill :)
> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Priority: Major
>
> I've inspired by this mail-list thread.
>
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years.
> While it has been slowly obsoleted, well-maintained and/or extended
> dictionaries risen up in recent years (e.g.
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd],
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially
> incompatible with the idea "switch the system dictionary", and developers
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the
> encoded dictionary (language model) had been decoupled (like MeCab, the
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a
> natural idea, and I feel that it's good time to re-think the current
> architecture.
> Also this would be good for advanced users who have customized/re-trained
> their own system dictionary.
> Goals of this issue:
> * Decouple JapaneseTokenizer itself and encoded system dictionary.
> * Implement dynamic dictionary load mechanism.
> * Provide developer-oriented dictionary build tool.
> Non-goals:
> * Provide learner or language model (it's up to users and should be outside
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or
> difficult at this moment.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]