Re: JapaneseAnalyzer's system vs user dict

Namgyu Kim Sun, 26 May 2019 06:49:03 -0700

I've been able to build a dictionary using DictionaryBuilder (I guess that
is what the "regenerate" task must be using?)
=>
Yes. That's right.
The "regenerate" run commands in the following order:
1) Compile the code (compile-tools)
2) Download the jar file (download-dict)
3) Save Noun.proper.csv diffs (patch-dict)
4) Run DictionaryBuilder. (build-dict)


Not a very user-friendly approach
=>
I think so about that approach.
It's not user-friendly and it is not good for the user.
I think it's better to get the parameters in constructor of
JapaneseTokenizer.

What do you think about this?

Warm regards,
Namgyu Kim


2019년 5월 26일 (일) 오후 9:19, Michael Sokolov <msoko...@gmail.com>님이 작성:

> Thanks, Namgyu. I've been able to build a dictionary using
> DictionaryBuilder (I guess that is what the "regenerate" task must be
> using?) and I can replace the existing one on the classpath with jar
> surgery for now. Not a very user-friendly approach, but it will enable
> me to run some experiments and see whether this is truly necessary for
> my use case.
>
> On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <kng0...@gmail.com> wrote:
> >
> > Sorry for the wrong information, Mike.
> > Tomoko is right.
> > I checked it wrong.
> >
> > User dictionary is independent from the system dictionary. If you give
> > the user entries, JapaneseTokenizer builds two FSTs one for the
> > built-in dictionary and one for the user dictionary and they are
> > retrieved separately.
> >
> > Please ignore the following lines in my e-mail.
> > ================================================
> > Japanese Analyzer does not load dictionaries by default.
> > ...
> > Since it is a way to create and pass the UserDictionary object, there is
> no
> > conflict between user dictionary and system dictionary.
> > (You may choose only one of them! -> means userFST instance in
> > JapaneseTokenizer)
> > =================================================
> >
> > The System dictionary and the User dictionary are separated and can have
> > each.
> >
> > About System dictionary,
> > As I know, it is not possible to change the System dictionary at the code
> > level.
> > The part that reads the System dictionary is hard-coded.
> > (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> > If you really need it, can you make a JIRA issue and proceed with me?
> >
> > But there is a way to build a new kuromoji jar.
> > 1. Modify your dictionary file and rebuild.
> >   1-1) Install MeCab
> >   1-2) Install MeCab Dictionary
> >   1-3) Modify your dictionary file
> >   1-4) Make it to tar.gz
> > 2. change kuromoji/ivy.xml from
> > <artifact name="ipadic" type=".tar.gz" url="
> >
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> > "/>
> > to
> > <artifact name="ipadic" type=".tar.gz" url="file:///your/tar
> > path/new_dic.tar.gz"/>
> > 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> > 4. "ant jar"
> >
> > I wish I could help you.
> >
> > Warm regards,
> > Namgyu Kim
> >
> > 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <msoko...@gmail.com>님이 작성:
> >
> > > Thank you for the detailed responses! What Tomoko is saying seems
> > > consistent with my cursory reading of the code. The reason I asked is
> > > I have a customer that thinks they want to replace the system
> > > dictionary, and I am trying to see if that is necessary. It seems as
> > > if for the most part, we can supply a comprehensive user dictionary
> > > and it would pretty much take the place of the system dictionary,
> > > assuming it is a superset (covers at least the original system dict
> > > tokens), but there is probably no way to "remove" a token that is
> > > present in the system dictionary (or maybe it can effectively be
> > > removed by adding it to user dictionary with a high penalty?). I'm not
> > > sure why one would want to do this removal, just trying to understand
> > > the design parameters.
> > >
> > > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> > > <tomoko.uchida.1...@gmail.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > > If I provide entries in the user
> > > > dictionary is it just as if I had included them in the system
> > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > weights supersede those in the system dictionary? Is there some way
> to
> > > > suppress entries in the system dict?
> > > >
> > > > User dictionary is independent from the system dictionary. If you
> give
> > > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > > built-in dictionary and one for the user dictionary and they are
> > > > retrieved separately.
> > > >
> > > > First the user dictionary is retrieved, and if there are no entries
> > > > matched then the system dictionary is retrieved. So if any entry is
> > > > found in the user dictionary, all possible candidates in the system
> > > > dictionary are ignored (suppressed).
> > > >
> > > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > > tagger retrieves both of the system dictionary and user dictionary
> and
> > > > compares their weights by performing Viterbi. In fact the behaviour -
> > > > always gives priority to the entries in the user dictionary - is a
> bit
> > > > too aggressive from the point of view of the consistency of
> > > > tokenization. I do not know why, but there may be some performance
> > > > reasons?)
> > > >
> > > > I think you can easily find the retrieval logic I described here in
> > > > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > > > not correct.)
> > > >
> > > > Regards,
> > > > Tomoko
> > > >
> > > > 2019年5月26日(日) 5:08 김남규 <kng0...@gmail.com>:
> > > > >
> > > > > Hi, Mike :D
> > > > >
> > > > > Japanese Analyzer does not load dictionaries by default.
> > > > > If you look at the constructor, you can see that it is created as
> null
> > > if
> > > > > not set parameters.
> > > > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > > > >
> > > > > In JapaneseTokenizer,
> > > > > =============================================
> > > > > if (userDictionary != null) {
> > > > >   userFST = userDictionary.getFST();
> > > > >   userFSTReader = userFST.getBytesReader();
> > > > > } else {
> > > > >   userFST = null;
> > > > >   userFSTReader = null;
> > > > > }
> > > > > =============================================
> > > > > Since it is a way to create and pass the UserDictionary object,
> there
> > > is no
> > > > > conflict between user dictionary and system dictionary.
> > > > > (You may choose only one of them! -> means userFST instance in
> > > > > JapaneseTokenizer)
> > > > >
> > > > > About dictionary,
> > > > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > > > It called MeCab which uses the Viterbi algorithm.
> > > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to
> FST
> > > and
> > > > > use
> > > > > But it can't satisfy all users.
> > > > > Depending on the situation, some user may need a custom dictionary.
> > > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The
> basic
> > > > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > > > The original Korean MeCab dictionary size is almost 220MB, but
> Lucene's
> > > > > dictionary size is 24MB.
> > > > > If the user needs a dictionary of 100MB size, the user must build
> and
> > > use
> > > > > it.
> > > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > > > >
> > > > > If anyone find some wrong information in my reply, please send a
> reply
> > > with
> > > > > the correction.
> > > > >
> > > > > Thank you,
> > > > > Namgyu Kim
> > > > >
> > > > >
> > > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <msoko...@gmail.com>님이
> 작성:
> > > > >
> > > > > > I'm trying to understand the relationship between the system and
> user
> > > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > > > provide a user dictionary; the system one is built in. Are they
> > > > > > otherwise the same kind of thing? If I provide entries in the
> user
> > > > > > dictionary is it just as if I had included them in the system
> > > > > > dictionary? If the same entry occurs in both, do the user
> dictionary
> > > > > > weights supersede those in the system dictionary? Is there some
> way
> > > to
> > > > > > suppress entries in the system dict?  I hunted for
> documentation, but
> > > > > > didn't find answers to these questions, and the code is pretty
> > > > > > involved, so any pointers would be greatly appreciated.
> > > > > >
> > > > > > -Mike
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > > For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> > > > > >
> > > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: JapaneseAnalyzer's system vs user dict

Reply via email to