Hi Tomoko :D Thank you for your reply and listening to my thinking. And I didn't know this question is old. Of course, I want to participate in the LUCENE-8816 issue.
I think this issue will take some time. I'll check it. Warm regards, Namgyu Kim On Tue, May 28, 2019 at 10:43 PM Tomoko Uchida <tomoko.uchida.1...@gmail.com> wrote: > Hi guys, > > I just created an issue related to this thread. > > Decouple Kuromoji's morphological analyser and its dictionary > https://issues.apache.org/jira/browse/LUCENE-8816 > > The problem discussed here is essentially within the current > architecture of Kuromoji (and Nori), "jar bundled system dictionary". > So, the most natural solution is decoupling the Viterbi logic and the > encoded dictionary (just as traditional Japanese morphological > analysis engines do so). > This is actually old question with respect to kuromoji, however I feel > like that it's a good time to re-think it. > > It will take time (and to be honest I'm not sure the patch will be > accepted) but I think it's much better than applying monkey-fixes to > the current build script. > If you are seriously interested in this work, please feel free to involve > it. > > Tomoko > > 2019年5月28日(火) 7:57 Tomoko Uchida <tomoko.uchida.1...@gmail.com>: > > > > Hi Namgyu, > > > > > There is a team that uses a well-ported system dictionary. > > > The Lucene version is up. (like 8.1 -> 8.2) > > > Suppose there was no modification to kuromoji in 8.2. > > > But the user has to port again. > > > The same goes for 8.2 to 8.3. > > > > I'm not sure about the situation at Korea, however, we also have some > > frequently updated, well-maintained (by NLP professionals) system > > dictionaries. > > 1. neologd (mecab ipadic extension) and 2. Sudachi (unidic extension > > partially including neologd) I mentioned in my previous mail. > > I agree with that it's a labor to re-build the tokenizer every time > > when upgrading. > > > > In both case, some outstanding contributors build and distribute > > plugins including up-to-date dictionary at a constant pace, and other > > users just use them. Seems this works greatly at least in Japan, for > > now. > > Maybe we can start from outside of Lucene project such like that? If > > the workflow works well and it's really needed, developers can propose > > the change (a patch for the build script, and possibly the system > > dictionary operation or update policy is also needed) to the Jira > > anytime. > > > > I know that current JapaneseAnalyzer's system dictionary (MeCab > > IPADIC) has been not maintained for ten years and developers/users > > often complain about it. > > For now I just see the effort of the developers community (including > > me) to try to find good solutions for that. > > > > Thanks, > > Tomoko > > > > 2019年5月28日(火) 2:42 Namgyu Kim <kng0...@gmail.com>: > > > > > > Thank you for your reply, Tomoko :D > > > > > > To be honest, I have not experienced it directly(means commercialize), > so I > > > can't tell the exact situation of the Japanese MeCab. > > > I respect your opinion and it is true that customization is a difficult > > > task. > > > > > > But I can talk a little bit about Korean MeCab. (The basic logic is the > > > same) > > > In the case of Hangul MeCab, system dictionary changes are very > frequent. > > > Developers do not design the engine from the bottom, so they tend to > try a > > > lot of tuning at some level. (like custom model, score matrix, custom > > > dictionary) > > > Especially in commercialization, developers make a lot of tuning to > make > > > the dictionary that is the most suitable for the purpose. > > > (Of course, the big tech companies use their own analyzers :D) > > > > > > MeCab is especially popular in Korea, so there are many attempts. > > > Developers often port it to Elasticsearch and use a lot, but they have > to > > > do a lot of boring work every time. > > > (It is not Korean MeCab case, but I think Mike and Trejkaz talked in > that > > > sense) > > > > > > There is another bad case. > > > > > > There is a team that uses a well-ported system dictionary. > > > The Lucene version is up. (like 8.1 -> 8.2) > > > Suppose there was no modification to kuromoji in 8.2. > > > But the user has to port again. > > > The same goes for 8.2 to 8.3. > > > Even if kuromoji has a fix that is not associated with Dictionary, the > user > > > has to port each time. > > > > > > At least if we allow them to read custom dat files, these problems can > be > > > disappeared. > > > > > > Warm regards, > > > Namgyu Kim > > > > > > On Mon, May 27, 2019 at 8:21 AM Tomoko Uchida < > tomoko.uchida.1...@gmail.com> > > > wrote: > > > > > > > > Anyway, in my personal opinion, Lucene does not need to consider > whether > > > > the system dictionary status is good or not. > > > > > > > > Please don't get me wrong, but I don't think so. > > > > Creating a customized or re-trained system dictionary still needs > deep > > > > knowledge about language and machine-learning. Even among in us, > > > > native Japanese, very few people can do so. > > > > The system dictionary is a key component for tokenization, so badly > > > > customized system dictionary directly affects to the search quality > > > > and I think we should prevent it. Instead of messing up the system > > > > dictionary without sufficient knowledge, please use the user > > > > dictionary. That is the reason why it exists. > > > > > > > > Anyway building the system dictionary (MeCab IPADIIC extensions), you > > > > do not need read or fix the DictionaryBuilder class. > > > > Just modify analysis/kuromoji/build.xml to use the > > > > customized/re-trained dictionary (tar ball). > > > > > > > > Tomoko > > > > > > > > 2019年5月27日(月) 1:48 Namgyu Kim <kng0...@gmail.com>: > > > > > > > > > > Oh, I think my explanation was not enough. Sorry... > > > > > > > > > > I mentioned the following sentences. > > > > > ============================= > > > > > 1. Modify your dictionary file and rebuild. > > > > > 1-1) Install MeCab > > > > > 1-2) Install MeCab Dictionary > > > > > 1-3) Modify your dictionary file > > > > > 1-4) Make it to tar.gz > > > > > ============================= > > > > > The "1-3)" does not mean user modifies the csv files and > compresses it > > > > back > > > > > to tar.gz. > > > > > It means re-training, of course user has to be careful and have > knowledge > > > > > of the Natural Language Processing. > > > > > Column 2, 3 and 4 in csv values are the values produced by > training. > > > > > (2 : left context id, 3 : right context id, 4 : cost) > > > > > These values are dependent on the model and matrix.def values. > (when use > > > > > mecab-dict-index) > > > > > > > > > > That's why I mentioned "1-1)" and "1-2)" processes first. > > > > > > > > > > Anyway, in my personal opinion, Lucene does not need to consider > whether > > > > > the system dictionary status is good or not. > > > > > I just think when some user wants to use a custom system > dictionary, it > > > > is > > > > > not user-friendly to modify the ant file or find some code for a > long > > > > time > > > > > to run the DictionaryBuilder. > > > > > I think there should be at least a guide. > > > > > > > > > > Warm regards, > > > > > Namgyu Kim > > > > > > > > > > P.S. Although not as good as the Tomoko's contents, there is a > list of > > > > > dictionaries supported by kuromoji. > > > > > (https://github.com/atilika/kuromoji#supported-dictionaries) > > > > > > > > > > > > > > > 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida < > tomoko.uchida.1...@gmail.com > > > > >님이 > > > > > 작성: > > > > > > > > > > > Hi, > > > > > > > > > > > > The system dictionary is not a mere "word collection", it > includes a > > > > > > machine-learned language model which is carefully trained by > > > > > > researchers. If you want to replace the system dictionary, you > have to > > > > > > start from "re-train" the model. This needs expert knowledge so > I do > > > > > > not recommend to just modify the CSVs and rebuild it (if you do > not > > > > > > have an expert about it). > > > > > > > > > > > > As far as relates to "modern words" which is not included the > current > > > > > > system dictionary, there are already a few options. > > > > > > > > > > > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC, > > > > > > Kuromoji's default dictionary) > > > > > > > > > > > > For Solr: > > > > > > > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0 > > > > > > (The branch is mine. A little bit old, but you can cherry-pick > the > > > > > > changes in the kuromoji's build.xml.) > > > > > > > > > > > > For Elasticsearch: > > > > > > > > > > > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd > > > > > > > > > > > > 2. Use Sudachi dictionary > > > > > > > > > > > > For Elasticsearch: > > > > > > https://github.com/WorksApplications/elasticsearch-sudachi > > > > > > This includes Lucene jar, so I think you can extract the jar for > Solr > > > > > > (I've never tried to use with Solr). > > > > > > > > > > > > Both are actively maintained by linguistics & NLP > > > > researchers/engineers. > > > > > > Please be careful, those are rather huge jars... > > > > > > > > > > > > Hope that helps. > > > > > > > > > > > > Tomoko > > > > > > > > > > > > 2019年5月26日(日) 23:11 Trejkaz <trej...@trypticon.org>: > > > > > > > > > > > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0...@gmail.com> > wrote: > > > > > > > > > > > > > > > I think so about that approach. > > > > > > > > It's not user-friendly and it is not good for the user. > > > > > > > > > > > > > > I think it's better to get the parameters in > > > > > > > > > > > > > > JapaneseTokenizer. > > > > > > > > > > > > > > > > What do you think about this? > > > > > > > > > > > > > > > > > > > > > A way to override the system dictionary would be useful for us > as > > > > well. > > > > > > We > > > > > > > often get people complaining that the current dictionary is > missing > > > > a lot > > > > > > > of common modern words, and there are alternate mecab > dictionaries > > > > > > sitting > > > > > > > around already which solve this problem. > > > > > > > > > > > > > > TX > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >