Sorry for the wrong information, Mike. Tomoko is right. I checked it wrong.
User dictionary is independent from the system dictionary. If you give the user entries, JapaneseTokenizer builds two FSTs one for the built-in dictionary and one for the user dictionary and they are retrieved separately. Please ignore the following lines in my e-mail. ================================================ Japanese Analyzer does not load dictionaries by default. ... Since it is a way to create and pass the UserDictionary object, there is no conflict between user dictionary and system dictionary. (You may choose only one of them! -> means userFST instance in JapaneseTokenizer) ================================================= The System dictionary and the User dictionary are separated and can have each. About System dictionary, As I know, it is not possible to change the System dictionary at the code level. The part that reads the System dictionary is hard-coded. (TokenInfoDictionary, UnknownDictionary, BinaryDictionary) If you really need it, can you make a JIRA issue and proceed with me? But there is a way to build a new kuromoji jar. 1. Modify your dictionary file and rebuild. 1-1) Install MeCab 1-2) Install MeCab Dictionary 1-3) Modify your dictionary file 1-4) Make it to tar.gz 2. change kuromoji/ivy.xml from <artifact name="ipadic" type=".tar.gz" url=" https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz "/> to <artifact name="ipadic" type=".tar.gz" url="file:///your/tar path/new_dic.tar.gz"/> 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji 4. "ant jar" I wish I could help you. Warm regards, Namgyu Kim 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <msoko...@gmail.com>님이 작성: > Thank you for the detailed responses! What Tomoko is saying seems > consistent with my cursory reading of the code. The reason I asked is > I have a customer that thinks they want to replace the system > dictionary, and I am trying to see if that is necessary. It seems as > if for the most part, we can supply a comprehensive user dictionary > and it would pretty much take the place of the system dictionary, > assuming it is a superset (covers at least the original system dict > tokens), but there is probably no way to "remove" a token that is > present in the system dictionary (or maybe it can effectively be > removed by adding it to user dictionary with a high penalty?). I'm not > sure why one would want to do this removal, just trying to understand > the design parameters. > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida > <tomoko.uchida.1...@gmail.com> wrote: > > > > Hi, > > > > > If I provide entries in the user > > dictionary is it just as if I had included them in the system > > dictionary? If the same entry occurs in both, do the user dictionary > > weights supersede those in the system dictionary? Is there some way to > > suppress entries in the system dict? > > > > User dictionary is independent from the system dictionary. If you give > > the user entries, JapaneseTokenizer builds two FSTs one for the > > built-in dictionary and one for the user dictionary and they are > > retrieved separately. > > > > First the user dictionary is retrieved, and if there are no entries > > matched then the system dictionary is retrieved. So if any entry is > > found in the user dictionary, all possible candidates in the system > > dictionary are ignored (suppressed). > > > > (I think this is kuromoji specific behaviour, the original mecab pos > > tagger retrieves both of the system dictionary and user dictionary and > > compares their weights by performing Viterbi. In fact the behaviour - > > always gives priority to the entries in the user dictionary - is a bit > > too aggressive from the point of view of the consistency of > > tokenization. I do not know why, but there may be some performance > > reasons?) > > > > I think you can easily find the retrieval logic I described here in > > JapaneseTokenizer#parse() method. (Let me know if my understanding is > > not correct.) > > > > Regards, > > Tomoko > > > > 2019年5月26日(日) 5:08 김남규 <kng0...@gmail.com>: > > > > > > Hi, Mike :D > > > > > > Japanese Analyzer does not load dictionaries by default. > > > If you look at the constructor, you can see that it is created as null > if > > > not set parameters. > > > (check testUserDict3() in TestJapaneseAnalyzer.java) > > > > > > In JapaneseTokenizer, > > > ============================================= > > > if (userDictionary != null) { > > > userFST = userDictionary.getFST(); > > > userFSTReader = userFST.getBytesReader(); > > > } else { > > > userFST = null; > > > userFSTReader = null; > > > } > > > ============================================= > > > Since it is a way to create and pass the UserDictionary object, there > is no > > > conflict between user dictionary and system dictionary. > > > (You may choose only one of them! -> means userFST instance in > > > JapaneseTokenizer) > > > > > > About dictionary, > > > Lucene has one pre-built dictionary by default since Lucene 3.6. > > > You can check it in org.apache.lucene.analysis.ja.dict. > > > It called MeCab which uses the Viterbi algorithm. > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST > and > > > use > > > But it can't satisfy all users. > > > Depending on the situation, some user may need a custom dictionary. > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic > > > logic(MeCab + FST) is similar to Japanese Analyzer) > > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's > > > dictionary size is 24MB. > > > If the user needs a dictionary of 100MB size, the user must build and > use > > > it. > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene) > > > > > > If anyone find some wrong information in my reply, please send a reply > with > > > the correction. > > > > > > Thank you, > > > Namgyu Kim > > > > > > > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msoko...@gmail.com>님이 작성: > > > > > > > I'm trying to understand the relationship between the system and user > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to > > > > provide a user dictionary; the system one is built in. Are they > > > > otherwise the same kind of thing? If I provide entries in the user > > > > dictionary is it just as if I had included them in the system > > > > dictionary? If the same entry occurs in both, do the user dictionary > > > > weights supersede those in the system dictionary? Is there some way > to > > > > suppress entries in the system dict? I hunted for documentation, but > > > > didn't find answers to these questions, and the code is pretty > > > > involved, so any pointers would be greatly appreciated. > > > > > > > > -Mike > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >