I've been able to build a dictionary using DictionaryBuilder (I guess that is what the "regenerate" task must be using?) => Yes. That's right. The "regenerate" run commands in the following order: 1) Compile the code (compile-tools) 2) Download the jar file (download-dict) 3) Save Noun.proper.csv diffs (patch-dict) 4) Run DictionaryBuilder. (build-dict)
Not a very user-friendly approach => I think so about that approach. It's not user-friendly and it is not good for the user. I think it's better to get the parameters in constructor of JapaneseTokenizer. What do you think about this? Warm regards, Namgyu Kim 2019년 5월 26일 (일) 오후 9:19, Michael Sokolov <msoko...@gmail.com>님이 작성: > Thanks, Namgyu. I've been able to build a dictionary using > DictionaryBuilder (I guess that is what the "regenerate" task must be > using?) and I can replace the existing one on the classpath with jar > surgery for now. Not a very user-friendly approach, but it will enable > me to run some experiments and see whether this is truly necessary for > my use case. > > On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <kng0...@gmail.com> wrote: > > > > Sorry for the wrong information, Mike. > > Tomoko is right. > > I checked it wrong. > > > > User dictionary is independent from the system dictionary. If you give > > the user entries, JapaneseTokenizer builds two FSTs one for the > > built-in dictionary and one for the user dictionary and they are > > retrieved separately. > > > > Please ignore the following lines in my e-mail. > > ================================================ > > Japanese Analyzer does not load dictionaries by default. > > ... > > Since it is a way to create and pass the UserDictionary object, there is > no > > conflict between user dictionary and system dictionary. > > (You may choose only one of them! -> means userFST instance in > > JapaneseTokenizer) > > ================================================= > > > > The System dictionary and the User dictionary are separated and can have > > each. > > > > About System dictionary, > > As I know, it is not possible to change the System dictionary at the code > > level. > > The part that reads the System dictionary is hard-coded. > > (TokenInfoDictionary, UnknownDictionary, BinaryDictionary) > > If you really need it, can you make a JIRA issue and proceed with me? > > > > But there is a way to build a new kuromoji jar. > > 1. Modify your dictionary file and rebuild. > > 1-1) Install MeCab > > 1-2) Install MeCab Dictionary > > 1-3) Modify your dictionary file > > 1-4) Make it to tar.gz > > 2. change kuromoji/ivy.xml from > > <artifact name="ipadic" type=".tar.gz" url=" > > > https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz > > "/> > > to > > <artifact name="ipadic" type=".tar.gz" url="file:///your/tar > > path/new_dic.tar.gz"/> > > 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji > > 4. "ant jar" > > > > I wish I could help you. > > > > Warm regards, > > Namgyu Kim > > > > 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <msoko...@gmail.com>님이 작성: > > > > > Thank you for the detailed responses! What Tomoko is saying seems > > > consistent with my cursory reading of the code. The reason I asked is > > > I have a customer that thinks they want to replace the system > > > dictionary, and I am trying to see if that is necessary. It seems as > > > if for the most part, we can supply a comprehensive user dictionary > > > and it would pretty much take the place of the system dictionary, > > > assuming it is a superset (covers at least the original system dict > > > tokens), but there is probably no way to "remove" a token that is > > > present in the system dictionary (or maybe it can effectively be > > > removed by adding it to user dictionary with a high penalty?). I'm not > > > sure why one would want to do this removal, just trying to understand > > > the design parameters. > > > > > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida > > > <tomoko.uchida.1...@gmail.com> wrote: > > > > > > > > Hi, > > > > > > > > > If I provide entries in the user > > > > dictionary is it just as if I had included them in the system > > > > dictionary? If the same entry occurs in both, do the user dictionary > > > > weights supersede those in the system dictionary? Is there some way > to > > > > suppress entries in the system dict? > > > > > > > > User dictionary is independent from the system dictionary. If you > give > > > > the user entries, JapaneseTokenizer builds two FSTs one for the > > > > built-in dictionary and one for the user dictionary and they are > > > > retrieved separately. > > > > > > > > First the user dictionary is retrieved, and if there are no entries > > > > matched then the system dictionary is retrieved. So if any entry is > > > > found in the user dictionary, all possible candidates in the system > > > > dictionary are ignored (suppressed). > > > > > > > > (I think this is kuromoji specific behaviour, the original mecab pos > > > > tagger retrieves both of the system dictionary and user dictionary > and > > > > compares their weights by performing Viterbi. In fact the behaviour - > > > > always gives priority to the entries in the user dictionary - is a > bit > > > > too aggressive from the point of view of the consistency of > > > > tokenization. I do not know why, but there may be some performance > > > > reasons?) > > > > > > > > I think you can easily find the retrieval logic I described here in > > > > JapaneseTokenizer#parse() method. (Let me know if my understanding is > > > > not correct.) > > > > > > > > Regards, > > > > Tomoko > > > > > > > > 2019年5月26日(日) 5:08 김남규 <kng0...@gmail.com>: > > > > > > > > > > Hi, Mike :D > > > > > > > > > > Japanese Analyzer does not load dictionaries by default. > > > > > If you look at the constructor, you can see that it is created as > null > > > if > > > > > not set parameters. > > > > > (check testUserDict3() in TestJapaneseAnalyzer.java) > > > > > > > > > > In JapaneseTokenizer, > > > > > ============================================= > > > > > if (userDictionary != null) { > > > > > userFST = userDictionary.getFST(); > > > > > userFSTReader = userFST.getBytesReader(); > > > > > } else { > > > > > userFST = null; > > > > > userFSTReader = null; > > > > > } > > > > > ============================================= > > > > > Since it is a way to create and pass the UserDictionary object, > there > > > is no > > > > > conflict between user dictionary and system dictionary. > > > > > (You may choose only one of them! -> means userFST instance in > > > > > JapaneseTokenizer) > > > > > > > > > > About dictionary, > > > > > Lucene has one pre-built dictionary by default since Lucene 3.6. > > > > > You can check it in org.apache.lucene.analysis.ja.dict. > > > > > It called MeCab which uses the Viterbi algorithm. > > > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to > FST > > > and > > > > > use > > > > > But it can't satisfy all users. > > > > > Depending on the situation, some user may need a custom dictionary. > > > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The > basic > > > > > logic(MeCab + FST) is similar to Japanese Analyzer) > > > > > The original Korean MeCab dictionary size is almost 220MB, but > Lucene's > > > > > dictionary size is 24MB. > > > > > If the user needs a dictionary of 100MB size, the user must build > and > > > use > > > > > it. > > > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene) > > > > > > > > > > If anyone find some wrong information in my reply, please send a > reply > > > with > > > > > the correction. > > > > > > > > > > Thank you, > > > > > Namgyu Kim > > > > > > > > > > > > > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msoko...@gmail.com>님이 > 작성: > > > > > > > > > > > I'm trying to understand the relationship between the system and > user > > > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to > > > > > > provide a user dictionary; the system one is built in. Are they > > > > > > otherwise the same kind of thing? If I provide entries in the > user > > > > > > dictionary is it just as if I had included them in the system > > > > > > dictionary? If the same entry occurs in both, do the user > dictionary > > > > > > weights supersede those in the system dictionary? Is there some > way > > > to > > > > > > suppress entries in the system dict? I hunted for > documentation, but > > > > > > didn't find answers to these questions, and the code is pretty > > > > > > involved, so any pointers would be greatly appreciated. > > > > > > > > > > > > -Mike > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > > For additional commands, e-mail: > java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >