Re: JapaneseAnalyzer's system vs user dict

Michael Sokolov Sun, 26 May 2019 05:19:51 -0700

Thanks, Namgyu. I've been able to build a dictionary using
DictionaryBuilder (I guess that is what the "regenerate" task must be
using?) and I can replace the existing one on the classpath with jar
surgery for now. Not a very user-friendly approach, but it will enable
me to run some experiments and see whether this is truly necessary for
my use case.


On Sun, May 26, 2019 at 7:56 AM Namgyu Kim <[email protected]> wrote:
>
> Sorry for the wrong information, Mike.
> Tomoko is right.
> I checked it wrong.
>
> User dictionary is independent from the system dictionary. If you give
> the user entries, JapaneseTokenizer builds two FSTs one for the
> built-in dictionary and one for the user dictionary and they are
> retrieved separately.
>
> Please ignore the following lines in my e-mail.
> ================================================
> Japanese Analyzer does not load dictionaries by default.
> ...
> Since it is a way to create and pass the UserDictionary object, there is no
> conflict between user dictionary and system dictionary.
> (You may choose only one of them! -> means userFST instance in
> JapaneseTokenizer)
> =================================================
>
> The System dictionary and the User dictionary are separated and can have
> each.
>
> About System dictionary,
> As I know, it is not possible to change the System dictionary at the code
> level.
> The part that reads the System dictionary is hard-coded.
> (TokenInfoDictionary, UnknownDictionary, BinaryDictionary)
> If you really need it, can you make a JIRA issue and proceed with me?
>
> But there is a way to build a new kuromoji jar.
> 1. Modify your dictionary file and rebuild.
>   1-1) Install MeCab
>   1-2) Install MeCab Dictionary
>   1-3) Modify your dictionary file
>   1-4) Make it to tar.gz
> 2. change kuromoji/ivy.xml from
> <artifact name="ipadic" type=".tar.gz" url="
> https://jaist.dl.sourceforge.net/project/mecab/mecab-ipadic/2.7.0-20070801/mecab-ipadic-2.7.0-20070801.tar.gz
> "/>
> to
> <artifact name="ipadic" type=".tar.gz" url="file:///your/tar
> path/new_dic.tar.gz"/>
> 3. "ant regenerate" in /your/path/lucene-solr/lucene/analysis/kuromoji
> 4. "ant jar"
>
> I wish I could help you.
>
> Warm regards,
> Namgyu Kim
>
> 2019년 5월 26일 (일) 오전 9:03, Michael Sokolov <[email protected]>님이 작성:
>
> > Thank you for the detailed responses! What Tomoko is saying seems
> > consistent with my cursory reading of the code. The reason I asked is
> > I have a customer that thinks they want to replace the system
> > dictionary, and I am trying to see if that is necessary. It seems as
> > if for the most part, we can supply a comprehensive user dictionary
> > and it would pretty much take the place of the system dictionary,
> > assuming it is a superset (covers at least the original system dict
> > tokens), but there is probably no way to "remove" a token that is
> > present in the system dictionary (or maybe it can effectively be
> > removed by adding it to user dictionary with a high penalty?). I'm not
> > sure why one would want to do this removal, just trying to understand
> > the design parameters.
> >
> > On Sat, May 25, 2019 at 7:30 PM Tomoko Uchida
> > <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > > If I provide entries in the user
> > > dictionary is it just as if I had included them in the system
> > > dictionary? If the same entry occurs in both, do the user dictionary
> > > weights supersede those in the system dictionary? Is there some way to
> > > suppress entries in the system dict?
> > >
> > > User dictionary is independent from the system dictionary. If you give
> > > the user entries, JapaneseTokenizer builds two FSTs one for the
> > > built-in dictionary and one for the user dictionary and they are
> > > retrieved separately.
> > >
> > > First the user dictionary is retrieved, and if there are no entries
> > > matched then the system dictionary is retrieved. So if any entry is
> > > found in the user dictionary, all possible candidates in the system
> > > dictionary are ignored (suppressed).
> > >
> > > (I think this is kuromoji specific behaviour, the original mecab pos
> > > tagger retrieves both of the system dictionary and user dictionary and
> > > compares their weights by performing Viterbi. In fact the behaviour -
> > > always gives priority to the entries in the user dictionary - is a bit
> > > too aggressive from the point of view of the consistency of
> > > tokenization. I do not know why, but there may be some performance
> > > reasons?)
> > >
> > > I think you can easily find the retrieval logic I described here in
> > > JapaneseTokenizer#parse() method. (Let me know if my understanding is
> > > not correct.)
> > >
> > > Regards,
> > > Tomoko
> > >
> > > 2019年5月26日(日) 5:08 김남규 <[email protected]>:
> > > >
> > > > Hi, Mike :D
> > > >
> > > > Japanese Analyzer does not load dictionaries by default.
> > > > If you look at the constructor, you can see that it is created as null
> > if
> > > > not set parameters.
> > > > (check testUserDict3() in TestJapaneseAnalyzer.java)
> > > >
> > > > In JapaneseTokenizer,
> > > > =============================================
> > > > if (userDictionary != null) {
> > > >   userFST = userDictionary.getFST();
> > > >   userFSTReader = userFST.getBytesReader();
> > > > } else {
> > > >   userFST = null;
> > > >   userFSTReader = null;
> > > > }
> > > > =============================================
> > > > Since it is a way to create and pass the UserDictionary object, there
> > is no
> > > > conflict between user dictionary and system dictionary.
> > > > (You may choose only one of them! -> means userFST instance in
> > > > JapaneseTokenizer)
> > > >
> > > > About dictionary,
> > > > Lucene has one pre-built dictionary by default since Lucene 3.6.
> > > > You can check it in org.apache.lucene.analysis.ja.dict.
> > > > It called MeCab which uses the Viterbi algorithm.
> > > > In Lucene, Convert MeCab dictionary(in Lucene, some dat files) to FST
> > and
> > > > use
> > > > But it can't satisfy all users.
> > > > Depending on the situation, some user may need a custom dictionary.
> > > > It is also same for Nori(Korean Analyzer) since Lucene 7.4. (The basic
> > > > logic(MeCab + FST) is similar to Japanese Analyzer)
> > > > The original Korean MeCab dictionary size is almost 220MB, but Lucene's
> > > > dictionary size is 24MB.
> > > > If the user needs a dictionary of 100MB size, the user must build and
> > use
> > > > it.
> > > > (Modify MeCab Dictionary -> Training -> Porting to Lucene)
> > > >
> > > > If anyone find some wrong information in my reply, please send a reply
> > with
> > > > the correction.
> > > >
> > > > Thank you,
> > > > Namgyu Kim
> > > >
> > > >
> > > > 2019년 5월 26일 (일) 오전 4:03, Michael Sokolov <[email protected]>님이 작성:
> > > >
> > > > > I'm trying to understand the relationship between the system and user
> > > > > dictionaries that JapaneseAnalyzer uses. The API allows a user to
> > > > > provide a user dictionary; the system one is built in. Are they
> > > > > otherwise the same kind of thing? If I provide entries in the user
> > > > > dictionary is it just as if I had included them in the system
> > > > > dictionary? If the same entry occurs in both, do the user dictionary
> > > > > weights supersede those in the system dictionary? Is there some way
> > to
> > > > > suppress entries in the system dict?  I hunted for documentation, but
> > > > > didn't find answers to these questions, and the code is pretty
> > > > > involved, so any pointers would be greatly appreciated.
> > > > >
> > > > > -Mike
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: JapaneseAnalyzer's system vs user dict

Reply via email to