Hi Cao Great job!
On Apr 12, 2005 2:37 PM, cao yuzhong <[EMAIL PROTECTED]> wrote: > hi,every one: > > I have integrated Nutch with an intelligent Chinese > Lexical Analysis System.So Nutch now can segment > Chinese words effectively. > > Following is my solution: > > 1.modify NutchAnalysis.jj: > > -| <#CJK: // non-alphabets > - [ > - "\u3040"-"\u318f", > - "\u3300"-"\u337f", > - "\u3400"-"\u3d2d", > - "\u4e00"-"\u9fff", > - "\uf900"-"\ufaff" > - ] > - > > > +| <#OTHER_CJK: //japanese and korean characters > + [ > + "\u3040"-"\u318f", > + "\u3300"-"\u337f", > + "\u3400"-"\u3d2d", > + "\uf900"-"\ufaff" > + ] > + > > +| <#CHINESE: //chinese characters > + [ > + "\u4e00"-"\u9fff" > + ] > + > > > -| <SIGRAM: <CJK> > > > +| <SIGRAM: <OTHER_CJK> > > +| <CNWORD: (<CHINESE>)+ > //chinese words > > - ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM>) > + ( token=<WORD> | token=<ACRONYM> | token=<SIGRAM> | token=<CNWORD>) > > I will segment chinese characters intelligently but japanese > and korean characters remains single-gram segmentation. If some JK developers here, we will be very glad:) > 2.modify NutchDocumentTokenizer.java > > -case EOF: case WORD: case ACRONYM: case SIGRAM: > +case EOF: case WORD: case ACRONYM: case SIGRAM: case CNWORD: > > 3.modify FastCharStream.java > I use ICTCLASC to perform Chinese word segmentation.ICTCLASC don't > just simply perform bi-gram segmentation but using an approach based on > multi-layer HMM. Its segmentation precision is 97.58% > ICTCLASC is free for researchers. see: > http://www.nlp.org.cn/project/project.php?proj_id=6 Cool, and I should learn more.... > 4.modify Summarizer.java > If Chinese word segmentation could be done in NutchAnalysis.jj > before tokenizer,then we don't need reset tokens' offset in > Summarizer.java and everything will be perfect. True. You will find the truth in NutchAnalysisTokenManager.jjFillToken() method. > But it seems too difficult to perform intelligent Chinese word > segmentation in NutchAnalysis.jj.Even impossible?? In fact, Chinese segementation issue equals to the question here: Say one english sentence S = "Nutchisasearchengine", how can we get/guess the result: R="Nutch is a search engine" to the best of our abilities ? > Any suggestions? > > Best regards > > Cao Yuzhong > 2005-04-12 > > Regards /Jack ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
