[Nutch-dev] Re: Chinese in Nutch:My solution

Jack Tang Tue, 12 Apr 2005 00:16:16 -0700

Hi Cao

Great job!


On Apr 12, 2005 2:37 PM, cao yuzhong <[EMAIL PROTECTED]> wrote:
> hi,every one:
> 
> I have integrated Nutch with an intelligent Chinese
> Lexical Analysis System.So Nutch now can segment
> Chinese words effectively.
> 
> Following is my solution:
> 
> 1.modify NutchAnalysis.jj:
> 
> -|  <#CJK:                                        // non-alphabets
> -      [
> -       "\u3040"-"\u318f",
> -       "\u3300"-"\u337f",
> -       "\u3400"-"\u3d2d",
> -       "\u4e00"-"\u9fff",
> -       "\uf900"-"\ufaff"
> -      ]
> -    >
> 
> +|  <#OTHER_CJK:  //japanese and korean characters
> +      [
> +       "\u3040"-"\u318f",
> +       "\u3300"-"\u337f",
> +       "\u3400"-"\u3d2d",
> +       "\uf900"-"\ufaff"
> +      ]
> +    >
> +|  <#CHINESE:   //chinese characters
> +     [
> +       "\u4e00"-"\u9fff"
> +     ]
> +   >
> 
> -| <SIGRAM: <CJK> >
> 
> +| <SIGRAM: <OTHER_CJK> >
> +| <CNWORD: (<CHINESE>)+ > //chinese words
> 
> - ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>)
> + ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>)
> 
> I will segment chinese characters intelligently but japanese
> and korean characters remains single-gram segmentation.

If some JK developers here, we will be very glad:)


> 2.modify NutchDocumentTokenizer.java
> 
> -case EOF: case WORD: case ACRONYM: case SIGRAM:
> +case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD:
> 
> 3.modify FastCharStream.java
> I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't
> just simply perform bi-gram segmentation but using an approach based on
> multi-layer HMM. Its segmentation precision is 97.58%
> ICTCLASC is free for researchers.  see:
> http://www.nlp.org.cn/project/project.php?proj_id=6

Cool, and I should learn more....

> 4.modify Summarizer.java

> If Chinese word segmentation could be done in NutchAnalysis.jj
> before tokenizer,then we don't need reset tokens' offset in
> Summarizer.java and everything will be perfect.

True. You will find the truth in NutchAnalysisTokenManager.jjFillToken() method.


> But it seems too difficult to perform intelligent Chinese word
> segmentation in NutchAnalysis.jj.Even impossible??

In fact, Chinese segementation issue equals to the question here:
Say one english sentence S = "Nutchisasearchengine", how can we
get/guess the result: R="Nutch is a search engine" to the best of our
abilities ?

> Any suggestions?
> 
> Best regards
> 
> Cao Yuzhong
> 2005-04-12
> 
> 

Regards
/Jack


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Chinese in Nutch:My solution

Reply via email to