Hi Transbuerg Could you please describe your solution in detail? Appreciate your time.
Regards /Jack On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote: > hi, > Jack Tang > > I have the same condition with u , could you share your total > NutchAnalysis.jj file at here, I am not use nutch but lucene . > > good luck. > > > http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx > > > 2005/7/15, Jack Tang < [EMAIL PROTECTED]>: > > Hi All > > > > It takes long time for me to think about embedding improved > > CJKAnalysis into NutchAnalysis. I got nothing but some failure > > experiences, and share with you, maybe you can hack it( well, I am not > > going to give up). > > > > I have written several Chinese words segmentation, some are dictionary > > based, such as Forward Maximum Matching(FMM) and Backward Maximum > > Matching(BMM), and some auto-segmentation, say bi-gram. And they work > > fine in pure Chinese words env.(not the mixture of Chinese and other > > languages). > > > > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj > > > > <orig> > > > > // chinese, japanese and korean characters > > | <SIGRAM: <CJK> > > > > > </orig> > > > > <modified> > > > > // chinese, japanese and korean characters > > | <SIGRAM: (<CJK>)+ > > > > > </modified> > > > > SIGRAM only contains CJK words. > > > > Well, I am not much familiar with JavaCC, so the big puzzle pauses me. > > As you know: > > > > // basic word -- lowercase it > > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ | > <IRREGULAR_WORD>)> > > { matchedToken.image = matchedToken.image.toLowerCase(); } > > > > this statement means if the sentence matches "WORD" rule, then the > > wrapped object matchedToken will extract > > target word. *ONE* word is extracted in one matching. > > > > so, in term() function, it is simple. > > > > /** Parse a single term. */ > > String term() : > > { > > Token token; > > } > > { > > ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable > > put "token=<SIGRAM>" here. > > > > { return token.image; } > > } > > > > For CJK it is quite different. We have to extract *MANY* words in one > matching. > > > > // chinese, japanese and korean characters > > | <SIGRAM: (<CJK>)+ > > > { > > // parse <CJK>+ will generate many words(tokens) here! > > } > > > > And my approach is constructing one TokenList to hold these tokens. > > The pseudocode looks like > > > > // chinese, japanese and korean characters > > | <SIGRAM: (<CJK>)+ > > > { > > for (int i = 0; i < image.length();...) { > > Token token = extract in bi-gram. > > tokenList.add(token); > > } > > } > > > > accordingly, the term() function should return ArrayList. > > > > /** .... **/ > > ArrayList term(): > > { > > Token token; > > } > > { > > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>) > > { > > return tokenList; > > } > > > > } > > > > After these modification, running NutchAnalysis.class, you will get odd > result. > > Say, I input some Chinese characters:C1C2C3 > > the result will be: "C1C2 C2C3" (NOTICE the quotation mark). > > > > I am in the wrong direction? Or will someone share any thoughts on > > NutchAnalysis.jj? Thanks > > > > > > > > Regards > > /Jack > > > > -- > > Keep Discovering ... ... > > http://www.jroller.com/page/jmars > > > > -- Keep Discovering ... ... http://www.jroller.com/page/jmars ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
