[Nutch-dev] Re: NutchAnalysis and CJK

Jack Tang Tue, 19 Jul 2005 02:04:53 -0700

Hi Transbuerg 

Could you please describe your solution in detail? Appreciate your time.


Regards
/Jack

On 7/15/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote:
> hi,
>           Jack Tang
>           
>           I have the same condition with u , could you share your total
> NutchAnalysis.jj file at here, I am not use nutch but lucene .
> 
>          good luck.
> 
>         
> http://blog.csdn.net/accesine960/archive/2005/07/13/424306.aspx
> 
> 
> 2005/7/15, Jack Tang < [EMAIL PROTECTED]>:
> > Hi All
> > 
> > It takes long time for me to think about embedding improved 
> > CJKAnalysis into NutchAnalysis. I got nothing but some failure
> > experiences, and share with you, maybe you can hack it( well, I am not
> > going to give up).
> > 
> > I have written several Chinese words segmentation, some are dictionary 
> > based, such as Forward Maximum Matching(FMM) and Backward Maximum
> > Matching(BMM), and some auto-segmentation, say bi-gram. And they work
> > fine in pure Chinese words env.(not the mixture of Chinese and other
> > languages). 
> > 
> > Why I only aim at pure Chinese words env.? In NutchAnalysis.jj
> > 
> > <orig>
> > 
> >   // chinese, japanese and korean characters
> > | <SIGRAM: <CJK> >
> > 
> > </orig>
> > 
> > <modified> 
> > 
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> > 
> > </modified>
> > 
> > SIGRAM only contains CJK words.
> > 
> > Well, I am not much familiar with JavaCC, so the big puzzle pauses me. 
> > As you know:
> > 
> >   // basic word -- lowercase it
> > <WORD: ((<LETTER>|<DIGIT>|<WORD_PUNCT>)+ |
> <IRREGULAR_WORD>)>
> >   { matchedToken.image = matchedToken.image.toLowerCase(); }
> > 
> > this statement means if the sentence matches "WORD" rule, then the
> > wrapped object matchedToken will extract
> > target word. *ONE* word is extracted in one matching.
> > 
> > so, in term() function, it is simple. 
> > 
> > /** Parse a single term. */
> > String term() :
> > {
> >   Token token;
> > }
> > {
> >   ( token=<WORD> | token=<ACRONYM>) // I don't think it is reasonable
> > put "token=<SIGRAM>" here.
> > 
> >   { return token.image; }
> > }
> > 
> > For CJK it is quite different. We have to extract *MANY* words in one
> matching.
> > 
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> > {
> > // parse <CJK>+ will generate many words(tokens) here! 
> > }
> > 
> > And my approach is constructing one TokenList to hold these tokens.
> > The pseudocode looks like
> > 
> >   // chinese, japanese and korean characters
> > | <SIGRAM: (<CJK>)+ >
> > {
> > for (int i = 0; i < image.length();...) {
> > Token token = extract in bi-gram.
> > tokenList.add(token);
> > }
> > }
> > 
> > accordingly, the term() function should return ArrayList.
> > 
> > /** .... **/
> > ArrayList term():
> > {
> > Token token;
> > }
> > {
> > (token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> >   {
> >     return tokenList;
> >   }
> > 
> > }
> > 
> > After these modification, running NutchAnalysis.class, you will get odd
> result.
> > Say, I input some Chinese characters:C1C2C3 
> > the result will be: "C1C2 C2C3" (NOTICE the quotation mark).
> > 
> > I am in the wrong direction? Or will someone share any thoughts on
> > NutchAnalysis.jj? Thanks
> > 
> > 
> > 
> > Regards
> > /Jack
> > 
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> > 
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: NutchAnalysis and CJK

Reply via email to