Thanks, Namgyu. I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) and I can replace the existing one on the classpath with jar surgery for now. Not a very user-friendly approach, but it will enable me to run some experiments and see whether this is truly necessary for my use case.
On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <kng0...@gmail.com> wrote: > > Sorry for the wrong information, Mike. > Tomoko is right. > I checked it wrong. > > User dictionary is independent from the system dictionary. If you give > the user entries, JapaneseTokenizer builds two FSTs one for the > built-in dictionary and one for the user dictionary and they are > retrieved separately. > > Please ignore the following lines in my e-mail. > ================================================ > Japanese Analyzer does not load dictionaries by default. > ... > Since it is a way to create and pass the UserDictionary object, there is no > conflict between user dictionary and system dictionary. > (You may choose only one of them! -> means userFST instance in > JapaneseTokenizer) > ================================================= > > The System dictionary and the User dictionary are separated and can have > each. > > About System dictionary, > As I know, it is not possible to change the System dictionary at the code > level. > The part that reads the System dictionary is hard-coded. > (TokenInfoDictionary, UnknownDictionary, BinaryDictionary) > If you really need it, can you make a JIRA issue and proceed with me? > > But there is a way to build a new kuromoji jar. > 1. Modify your dictionary file and rebuild. > 1-1) Install MeCab > 1-2) Install MeCab Dictionary > 1-3) Modify your dictionary file > 1-4) Make it to tar.gz > 2. change kuromoji/ivy.xml from > <artifact name="ipadic" type=".tar.gz" url=" > https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz > "/> > to > <artifact name="ipadic" type=".tar.gz" url="file:///your/tar > path/new_dic.tar.gz"/> > 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji > 4. "ant jar" > > I wish I could help you. > > Warm regards, > Namgyu Kim > > 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <msoko...@gmail.com>님이 작성: > > > Thank you for the detailed responses! What Tomoko is saying seems > > consistent with my cursory reading of the code. The reason I asked is > > I have a customer that thinks they want to replace the system > > dictionary, and I am trying to see if that is necessary. It seems as > > if for the most part, we can supply a comprehensive user dictionary > > and it would pretty much take the place of the system dictionary, > > assuming it is a superset (covers at least the original system dict > > tokens), but there is probably no way to "remove" a token that is > > present in the system dictionary (or maybe it can effectively be > > removed by adding it to user dictionary with a high penalty?). I'm not > > sure why one would want to do this removal, just trying to understand > > the design parameters. > > > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida > > <tomoko.uchida.1...@gmail.com> wrote: > > > > > > Hi, > > > > > > > If I provide entries in the user > > > dictionary is it just as if I had included them in the system > > > dictionary? If the same entry occurs in both, do the user dictionary > > > weights supersede those in the system dictionary? Is there some way to > > > suppress entries in the system dict? > > > > > > User dictionary is independent from the system dictionary. If you give > > > the user entries, JapaneseTokenizer builds two FSTs one for the > > > built-in dictionary and one for the user dictionary and they are > > > retrieved separately. > > > > > > First the user dictionary is retrieved, and if there are no entries > > > matched then the system dictionary is retrieved. So if any entry is > > > found in the user dictionary, all possible candidates in the system > > > dictionary are ignored (suppressed). > > > > > > (I think this is kuromoji specific behaviour, the original mecab pos > > > tagger retrieves both of the system dictionary and user dictionary and > > > compares their weights by performing Viterbi. In fact the behaviour - > > > always gives priority to the entries in the user dictionary - is a bit > > > too aggressive from the point of view of the consistency of > > > tokenization. I do not know why, but there may be some performance > > > reasons?) > > > > > > I think you can easily find the retrieval logic I described here in > > > JapaneseTokenizer#parse() method. (Let me know if my understanding is > > > not correct.) > > > > > > Regards, > > > Tomoko > > > > > > 2019年5月26日(日) 5:08 김남규 <kng0...@gmail.com>: > > > > > > > > Hi, Mike :D > > > > > > > > Japanese Analyzer does not load dictionaries by default. > > > > If you look at the constructor, you can see that it is created as null > > if > > > > not set parameters. > > > > (check testUserDict3() in TestJapaneseAnalyzer.java) > > > > > > > > In JapaneseTokenizer, > > > > ============================================= > > > > if (userDictionary != null) { > > > > userFST = userDictionary.getFST(); > > > > userFSTReader = userFST.getBytesReader(); > > > > } else { > > > > userFST = null; > > > > userFSTReader = null; > > > > } > > > > ============================================= > > > > Since it is a way to create and pass the UserDictionary object, there > > is no > > > > conflict between user dictionary and system dictionary. > > > > (You may choose only one of them! -> means userFST instance in > > > > JapaneseTokenizer) > > > > > > > > About dictionary, > > > > Lucene has one pre-built dictionary by default since Lucene 3.6. > > > > You can check it in org.apache.lucene.analysis.ja.dict. > > > > It called MeCab which uses the Viterbi algorithm. > > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST > > and > > > > use > > > > But it can't satisfy all users. > > > > Depending on the situation, some user may need a custom dictionary. > > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic > > > > logic(MeCab + FST) is similar to Japanese Analyzer) > > > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's > > > > dictionary size is 24MB. > > > > If the user needs a dictionary of 100MB size, the user must build and > > use > > > > it. > > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene) > > > > > > > > If anyone find some wrong information in my reply, please send a reply > > with > > > > the correction. > > > > > > > > Thank you, > > > > Namgyu Kim > > > > > > > > > > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msoko...@gmail.com>님이 작성: > > > > > > > > > I'm trying to understand the relationship between the system and user > > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to > > > > > provide a user dictionary; the system one is built in. Are they > > > > > otherwise the same kind of thing? If I provide entries in the user > > > > > dictionary is it just as if I had included them in the system > > > > > dictionary? If the same entry occurs in both, do the user dictionary > > > > > weights supersede those in the system dictionary? Is there some way > > to > > > > > suppress entries in the system dict? I hunted for documentation, but > > > > > didn't find answers to these questions, and the code is pretty > > > > > involved, so any pointers would be greatly appreciated. > > > > > > > > > > -Mike > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org