Oh, I think my explanation was not enough. Sorry... I mentioned the following sentences. ============================= 1. Modify your dictionary file and rebuild. 1-1) Install MeCab 1-2) Install MeCab Dictionary 1-3) Modify your dictionary file 1-4) Make it to tar.gz ============================= The "1-3)" does not mean user modifies the csv files and compresses it back to tar.gz. It means re-training, of course user has to be careful and have knowledge of the Natural Language Processing. Column 2, 3 and 4 in csv values are the values produced by training. (2 : left context id, 3 : right context id, 4 : cost) These values are dependent on the model and matrix.def values. (when use mecab-dict-index)
That's why I mentioned "1-1)" and "1-2)" processes first. Anyway, in my personal opinion, Lucene does not need to consider whether the system dictionary status is good or not. I just think when some user wants to use a custom system dictionary, it is not user-friendly to modify the ant file or find some code for a long time to run the DictionaryBuilder. I think there should be at least a guide. Warm regards, Namgyu Kim P.S. Although not as good as the Tomoko's contents, there is a list of dictionaries supported by kuromoji. (https://github.com/atilika/kuromoji#supported-dictionaries) 2019년 5월 27일 (월) 오전 12:12, Tomoko Uchida <tomoko.uchida.1...@gmail.com>님이 작성: > Hi, > > The system dictionary is not a mere "word collection", it includes a > machine-learned language model which is carefully trained by > researchers. If you want to replace the system dictionary, you have to > start from "re-train" the model. This needs expert knowledge so I do > not recommend to just modify the CSVs and rebuild it (if you do not > have an expert about it). > > As far as relates to "modern words" which is not included the current > system dictionary, there are already a few options. > > 1. Use neologd dictionary (it's an extension of MeCab IPADIC, > Kuromoji's default dictionary) > > For Solr: > https://github.com/mocobeta/lucene-solr/tree/kuromoji-neologd_5_4_0 > (The branch is mine. A little bit old, but you can cherry-pick the > changes in the kuromoji's build.xml.) > > For Elasticsearch: > https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd > > 2. Use Sudachi dictionary > > For Elasticsearch: > https://github.com/WorksApplications/elasticsearch-sudachi > This includes Lucene jar, so I think you can extract the jar for Solr > (I've never tried to use with Solr). > > Both are actively maintained by linguistics & NLP researchers/engineers. > Please be careful, those are rather huge jars... > > Hope that helps. > > Tomoko > > 2019年5月26日(日) 23:11 Trejkaz <trej...@trypticon.org>: > > > > On Sun, 26 May 2019 at 23:49, Namgyu Kim <kng0...@gmail.com> wrote: > > > > > I think so about that approach. > > > It's not user-friendly and it is not good for the user. > > > > I think it's better to get the parameters in > > > > JapaneseTokenizer. > > > > > > What do you think about this? > > > > > > A way to override the system dictionary would be useful for us as well. > We > > often get people complaining that the current dictionary is missing a lot > > of common modern words, and there are alternate mecab dictionaries > sitting > > around already which solve this problem. > > > > TX > > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >