hi, Jack My chinese phrase dictionary and chinese sentence segmentation programm you could get at the below link: [sandbox]Lucene中文分词的2个试验模块<http://www.grass.org.cn/blog/archives/2005/07/sandboxluceneae.html>» gRaSSland开发日记<http://www.grass.org.cn/blog/>
and now , I had solved the problems I mentioned at previous mail. good luck tian 2005/7/22, Jack Tang <[EMAIL PROTECTED]>: > > Hi Transbuerg > > Firstly, could you please explain "same condition with u"? Thanks > > I have no constructive suggestion to you, because your dictionary > are really huge. If you can share with me, I will appricate that. > > And here is my tips on the dictionary structure. Optimization? I > don't think so. > 1. I split the whole dictionary into small pieces according > Chinese pronunciation. > 2. using dictionary-lazy-loading when indexing > 3. loading on demand when search, and you can control how many > pieces should be in memory > > Thoughts? > > /Jack > > On 7/20/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote: > > hi, > > the weblucene do not use dictionary base segmentation. on the > > contrary , it use the bi-gram segmentation. you could get more > infomation > > at : http://www.chedong.com or search ³µ¶« and lucene for more > information. > > > > at this time , I am try to use dictionary base segmentation , you > > could visit my blog : > > http://blog.csdn.net/accesine960/category/35308.aspx > > > > I have written a dictionary based segmentation java programe , but > > now under test condition. > > I meet two question : > > 1. my dictionary term all about 150000 chinese phrases , so I put it > > to a hashmap when segmenting. > > 2. use the java programe , the build Index process is very good , but > > when searching , my computer server CPU alway 99.% busy. ( my server: 4G > > mem and 4 cpu and the index file size about: 2.2G ) > > > > > > so , recent days, I am strive to solve the above 2 questions. > > > > good luck > > if you are chinese , we could use chinese for further exchange...... > > > > 2005/7/19, Jack Tang <[EMAIL PROTECTED]>: > > > Hi Transbuerg > > > > > > Could you please describe your solution in detail? Appreciate your > time. > > > > > > Regards > > > /Jack > > > > > > On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED] > wrote: > > > > hi, > > > > Jack Tang > > > > > > > > I have the same condition with u , could you share your total > > > > NutchAnalysis.jj file at here, I am not use nutch but lucene . > > > > > > > > good luck. > > > > > > > > > > > > > > http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx > > > > > > > > > > > > 2005/7/15, Jack Tang < [EMAIL PROTECTED]>: > > > > > Hi All > > > > > > > > > > It takes long time for me to think about embedding improved > > > > > CJKAnalysis into NutchAnalysis. I got nothing but some failure > > > > > experiences, and share with you, maybe you can hack it( well, I am > not > > > > > going to give up). > > > > > > > > > > I have written several Chinese words segmentation, some are > dictionary > > > > > based, such as Forward Maximum Matching(FMM) and Backward Maximum > > > > > Matching(BMM), and some auto-segmentation, say bi-gram. And they > work > > > > > fine in pure Chinese words env.(not the mixture of Chinese and > other > > > > > languages). > > > > > > > > > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj > > > > > > > > > > <orig> > > > > > > > > > > // chinese, japanese and korean characters > > > > > | <SIGRAM: <CJK> > > > > > > > > > > > </orig> > > > > > > > > > > <modified> > > > > > > > > > > // chinese, japanese and korean characters > > > > > | <SIGRAM: (<CJK>)+ > > > > > > > > > > > </modified> > > > > > > > > > > SIGRAM only contains CJK words. > > > > > > > > > > Well, I am not much familiar with JavaCC, so the big puzzle pauses > me. > > > > > As you know: > > > > > > > > > > // basic word -- lowercase it > > > > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | > > > > <IRREGULAR_WORD>)> > > > > > { matchedToken.image = matchedToken.image.toLowerCase(); } > > > > > > > > > > this statement means if the sentence matches "WORD" rule, then the > > > > > wrapped object matchedToken will extract > > > > > target word. *ONE* word is extracted in one matching. > > > > > > > > > > so, in term() function, it is simple. > > > > > > > > > > /** Parse a single term. */ > > > > > String term() : > > > > > { > > > > > Token token; > > > > > } > > > > > { > > > > > ( token=<WORD> | token=<ACRONYM>) // I don't think it is > reasonable > > > > > put "token=<SIGRAM>" here. > > > > > > > > > > { return token.image; } > > > > > } > > > > > > > > > > For CJK it is quite different. We have to extract *MANY* words in > one > > > > matching. > > > > > > > > > > // chinese, japanese and korean characters > > > > > | <SIGRAM: (<CJK>)+ > > > > > > { > > > > > // parse <CJK>+ will generate many words(tokens) here! > > > > > } > > > > > > > > > > And my approach is constructing one TokenList to hold these > tokens. > > > > > The pseudocode looks like > > > > > > > > > > // chinese, japanese and korean characters > > > > > | <SIGRAM: (<CJK>)+ > > > > > > { > > > > > for (int i = 0; i < image.length ();...) { > > > > > Token token = extract in bi-gram. > > > > > tokenList.add(token); > > > > > } > > > > > } > > > > > > > > > > accordingly, the term() function should return ArrayList. > > > > > > > > > > /** .... **/ > > > > > ArrayList term(): > > > > > { > > > > > Token token; > > > > > } > > > > > { > > > > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>) > > > > > { > > > > > return tokenList; > > > > > } > > > > > > > > > > } > > > > > > > > > > After these modification, running NutchAnalysis.class, you will > get > > odd > > > > result. > > > > > Say, I input some Chinese characters:C1C2C3 > > > > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark). > > > > > > > > > > I am in the wrong direction? Or will someone share any thoughts on > > > > > NutchAnalysis.jj? Thanks > > > > > > > > > > > > > > > > > > > > Regards > > > > > /Jack > > > > > > > > > > -- > > > > > Keep Discovering ... ... > > > > > http://www.jroller.com/page/jmars > > > > > > > > > > > > > > > > > > > > > > -- > > > Keep Discovering ... ... > > > http://www.jroller.com/page/jmars > > > > > > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars >
