Re: JapaneseAnalyzer's system vs user dict

Tomoko Uchida Mon, 27 May 2019 15:58:08 -0700

Hi Namgyu,

> There is a team that uses a well-ported system dictionary.
> The Lucene version is up. (like 8.1 -> 8.2)
> Suppose there was no modification to kuromoji in 8.2.
> But the user has to port again.
> The same goes for 8.2 to 8.3.


I'm not sure about the situation at Korea, however, we also have some
frequently updated, well-maintained (by NLP professionals) system
dictionaries.
1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension
partially including neologd) I mentioned in my previous mail.
I agree with that it's a labor to re-build the tokenizer every time
when upgrading.

In both case, some outstanding contributors build and distribute
plugins including up-to-date dictionary at a constant pace, and other
users just use them. Seems this works greatly at least in Japan, for
now.
Maybe we can start from outside of Lucene project such like that? If
the workflow works well and it's really needed, developers can propose
the change (a patch for the build script, and possibly the system
dictionary operation or update policy is also needed) to the Jira
anytime.

I know that current JapaneseAnalyzer's system dictionary (MeCab
IPADIC) has been not maintained for ten years and developers/users
often complain about it.
For now I just see the effort of the developers community (including
me) to try to find good solutions for that.

Thanks,
Tomoko

2019年5月28日(火) 2:42 Namgyu Kim <[email protected]>:
>
> Thank you for your reply, Tomoko :D
>
> To be honest, I have not experienced it directly(means commercialize), so I
> can't tell the exact situation of the Japanese MeCab.
> I respect your opinion and it is true that customization is a difficult
> task.
>
> But I can talk a little bit about Korean MeCab. (The basic logic is the
> same)
> In the case of Hangul MeCab, system dictionary changes are very frequent.
> Developers do not design the engine from the bottom, so they tend to try a
> lot of tuning at some level. (like custom model, score matrix, custom
> dictionary)
> Especially in commercialization, developers make a lot of tuning to make
> the dictionary that is the most suitable for the purpose.
> (Of course, the big tech companies use their own analyzers :D)
>
> MeCab is especially popular in Korea, so there are many attempts.
> Developers often port it to Elasticsearch and use a lot, but they have to
> do a lot of boring work every time.
> (It is not Korean MeCab case, but I think Mike and Trejkaz talked in that
> sense)
>
> There is another bad case.
>
> There is a team that uses a well-ported system dictionary.
> The Lucene version is up. (like 8.1 -> 8.2)
> Suppose there was no modification to kuromoji in 8.2.
> But the user has to port again.
> The same goes for 8.2 to 8.3.
> Even if kuromoji has a fix that is not associated with Dictionary, the user
> has to port each time.
>
> At least if we allow them to read custom dat files, these problems can be
> disappeared.
>
> Warm regards,
> Namgyu Kim
>
> On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida <[email protected]>
> wrote:
>
> > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > the system dictionary status is good or not.
> >
> > Please don't get me wrong, but I don't think so.
> > Creating a customized or re-trained system dictionary still needs deep
> > knowledge about language and machine-learning. Even among in us,
> > native Japanese, very few people can do so.
> > The system dictionary is a key component for tokenization, so badly
> > customized system dictionary directly affects to the search quality
> > and I think we should prevent it. Instead of messing up the system
> > dictionary without sufficient knowledge, please use the user
> > dictionary. That is the reason why it exists.
> >
> > Anyway building the system dictionary (MeCab IPADIIC extensions), you
> > do not need read or fix the DictionaryBuilder class.
> > Just modify analysis/kuromoji/build.xml to use the
> > customized/re-trained dictionary (tar ball).
> >
> > Tomoko
> >
> > 2019年5月27日(月) 1:48 Namgyu Kim <[email protected]>:
> > >
> > > Oh, I think my explanation was not enough. Sorry...
> > >
> > > I mentioned the following sentences.
> > > =============================
> > > 1. Modify your dictionary file and rebuild.
> > >   1-1) Install MeCab
> > >   1-2) Install MeCab Dictionary
> > >   1-3) Modify your dictionary file
> > >   1-4) Make it to tar.gz
> > > =============================
> > > The "1-3)" does not mean user modifies the csv files and compresses it
> > back
> > > to tar.gz.
> > > It means re-training, of course user has to be careful and have knowledge
> > > of the Natural Language Processing.
> > > Column 2, 3 and 4 in csv values are the values produced by training.
> > > (2 : left context id, 3 : right context id, 4 : cost)
> > > These values are dependent on the model and matrix.def values. (when use
> > > mecab-dict-index)
> > >
> > > That's why I mentioned "1-1)" and "1-2)" processes first.
> > >
> > > Anyway, in my personal opinion, Lucene does not need to consider whether
> > > the system dictionary status is good or not.
> > > I just think when some user wants to use a custom system dictionary, it
> > is
> > > not user-friendly to modify the ant file or find some code for a long
> > time
> > > to run the DictionaryBuilder.
> > > I think there should be at least a guide.
> > >
> > > Warm regards,
> > > Namgyu Kim
> > >
> > > P.S. Although not as good as the Tomoko's contents, there is a list of
> > > dictionaries supported by kuromoji.
> > > (https://github.com/atilika/kuromoji#supported-dictionaries)
> > >
> > >
> > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <[email protected]
> > >님이
> > > 작성:
> > >
> > > > Hi,
> > > >
> > > > The system dictionary is not a mere "word collection", it includes a
> > > > machine-learned language model which is carefully trained by
> > > > researchers. If you want to replace the system dictionary, you have to
> > > > start from "re-train" the model. This needs expert knowledge so I do
> > > > not recommend to just modify the CSVs and rebuild it (if you do not
> > > > have an expert about it).
> > > >
> > > > As far as relates to "modern words" which is not included the current
> > > > system dictionary, there are already a few options.
> > > >
> > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC,
> > > > Kuromoji's default dictionary)
> > > >
> > > > For Solr:
> > > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0
> > > > (The branch is mine. A little bit old, but you can cherry-pick the
> > > > changes in the kuromoji's build.xml.)
> > > >
> > > > For Elasticsearch:
> > > >
> > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
> > > >
> > > > 2. Use Sudachi dictionary
> > > >
> > > > For Elasticsearch:
> > > > https://github.com/WorksApplications/elasticsearch-sudachi
> > > > This includes Lucene jar, so I think you can extract the jar for Solr
> > > > (I've never tried to use with Solr).
> > > >
> > > > Both are actively maintained by linguistics & NLP
> > researchers/engineers.
> > > > Please be careful, those are rather huge jars...
> > > >
> > > > Hope that helps.
> > > >
> > > > Tomoko
> > > >
> > > > 2019年5月26日(日) 23:11 Trejkaz <[email protected]>:
> > > > >
> > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <[email protected]> wrote:
> > > > >
> > > > > > I think so about that approach.
> > > > > > It's not user-friendly and it is not good for the user.
> > > > >
> > > > > I think it's better to get the parameters in
> > > > >
> > > > > JapaneseTokenizer.
> > > > > >
> > > > > > What do you think about this?
> > > > >
> > > > >
> > > > > A way to override the system dictionary would be useful for us as
> > well.
> > > > We
> > > > > often get people complaining that the current dictionary is missing
> > a lot
> > > > > of common modern words, and there are alternate mecab dictionaries
> > > > sitting
> > > > > around already which solve this problem.
> > > > >
> > > > > TX
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: JapaneseAnalyzer's system vs user dict

Reply via email to