Hi Jack: May I know what kind of segmentation you use for your CJK project? Did you add in your own bi-gram segmentation for CJK?
I noticed the project Doug did for creativecommons.org using Nutch, I tested that website search function and found even Chinese search return quite decent result. Can Doug share with us whether any special handling is included for thoese CJK-related result? Or you just use default NutchDocumentTokenizer for creativecommons.org also. Thanks to all for your reading. Guoqiao -----Original Message----- From: Jack Tang [mailto:[EMAIL PROTECTED] Sent: Thursday, June 02, 2005 2:18 PM To: [email protected]; Transbuerg Tian Subject: Re: Can I build CJK application based no Nutch? Sorry, I am wrong. It is still broken in svn. I tried to merge bi-gram segmentation into NutchAnalysis.jj. It seems hard and will take a lot of time. Can someone working on CJK thread give me some advice ? /Jack On 6/2/05, Jack Tang <[EMAIL PROTECTED]> wrote: > Hi Tian&Wu > > I suppose nutch now supports CJK bi-gram segmentation now. > > /Jack > > On 5/25/05, Transbuerg Tian <[EMAIL PROTECTED]> wrote: > > hi, wufuheng, > > > > first: > > if you are using lucene or nutch for indexing chinese content, I > > recommend weblucene for you , you could get more info at : > > http://www.chedong.com . > > second: > > cjk sentence split is quite different , for chinese , the very > > famous is use > > > > ICTCLAS , you could search it at google, > > > > and I write a chinese sentence spliter , by java, c sharp ,both. > > > > you can get that at: http://www.domolo.com/tec/index.htm > > or write a letter to : [EMAIL PROTECTED] > > > > hope this will help you. > > > > transbuerg tian > > beijing,china > > http://www.domolo.com > > > > > > > > > > 2005/5/24, wu fuheng <[EMAIL PROTECTED]>: > > > > > > Dear all, > > > I think Nutch is a good wrapper for Lucene and with a good > > > crawler. Now if I want to build some Chinese/Japan/Korean Language > > > search application. Should I start from Lucene or Nutch? How Nutch > > > does support CJK application? Sincerely your, > > > Simon > > > > > > > >
