Hello cao, I tried Chinese specified tokenizer today. And it was so odd, and I could not get the query result in nutch either. So, I think maybe there are some differences between Nutch's query and Luke's query. Anyone can explain?
Thanks, /Jack ======= At 2005-03-17, 13:49:00 you wrote: ======= >I have added Chinese stopwords in String[] STOP_WORDS in NutchAnalysis.jj. >My problem is Nutch returns nothing when I using any Chinese keywords. >Even though I can find these Chinese keywords in the index files(using >luke). > > >>From: "Jason Tang" <[EMAIL PROTECTED]> >>Reply-To: [email protected] >>To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment >>Date: Thu, 17 Mar 2005 11:08:15 +0800 >> >>Hi cao >> >>I think character "的" is stopword in Chinese characters. >>I think NutchAnalysis.jj should load different stopwords file when the >language is different. >> >>/Jack >> >> >> >>======= At 2005-03-17, 10:27:40 you wrote: ======= >> >> >No anwser for this? >> >Any tips are appreciated. >> > >> >>From: "cao yuzhong" <[EMAIL PROTECTED]> >> >>Reply-To: [email protected] >> >>To: [email protected] >> >>CC: [EMAIL PROTECTED] >> >>Subject: A problem about Chinese word segment >> >>Date: Tue, 15 Mar 2005 05:16:30 +0000 >> >> >> >>hi,all >> >> >> >>Now,Nutch-0.6 simply treats a Chinese character as a single token. >> >>I have attempted to make it treating some relative Chinese >> >>characters(called Chinese word) as a token. >> >>So I need to modified the Analyzer. >> >> >> >>First,I modified the file NutchAnalysis.jj in >> >>src/java/net/nutch/analysis. >> >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that >> >>Nutch can >> >>treat one or more Chinese characters as a token. Then I used JavaCC >> >>to generate the code. >> >> >> >>Second,I have to segment Chinese texts into Chinese words(insert >> >>space between two Chinese words) before indexing so that Nutch can >> >>recognize them.I have written a class >> >>to do this and I have modified the function refill() in >> >>FastCharStream.java: >> >> >> >>below the line : >> >>int charsRead =input.read(buffer, newPosition, >> >>buffer.length-newPosition); >> >> >> >>I added: >> >>//---- >> >>if(charsRead!=-1){ >> >> >> >>String str=new String(buffer,newPostion,charsRead); >> >> >> >>//do Chinese word segment,fox example >> >>//if str1="中文搜索引擎的分词问题" >> >>//then str2 will be "中文 搜索引擎 的 分词 问题" >> >>String str2 = Spliter.segSentence(str1); >> >> >> >>while(str2.length()>buffer.length-newPosition){ //expand the buffer >> >> char[] newBuffer = new char[buffer.length*2]; >> >> System.arraycopy(buffer, 0, newBuffer, 0, buffer.length); >> >> buffer = newBuffer; >> >>} >> >> >> >>for(int i=0;i<str2.length();i++){ >> >> buffer[newPosition+i]=str2.charAt(i); >> >>} >> >>charsRead=str2.length(); >> >> } >> >>//---- >> >> >> >>Third, compiling... ,running CrawlTool.... >> >>Then I used lukeall-0.5 to view the index directory. >> >>It's ok---Not single Chinese characters but Chinese words have been >> >>organized as terms. >> >> >> >>But when I deploy Nutch in Tomcat5.5 and do the searching test, >> >>it cann't find anything. What's wrong? >> >> >> >>I need your hints or you may recommend me some articles about this. >> >> >> >>Best regards. >> >> >> >>Cao Yuzhong >> >>2005-03-15 >> >> >> >> >> > >> > >> > >> > >> >------------------------------------------------------- >> >SF email is sponsored by - The IT Product Guide >> >Read honest & candid reviews on hundreds of IT Products from real users. >> >Discover which products truly live up to the hype. Start reading now. >> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >> >_______________________________________________ >> >Nutch-developers mailing list >> >[email protected] >> >https://lists.sourceforge.net/lists/listinfo/nutch-developers >> >>= = = = = = = = = = = = = = = = = = = = >> > > > > >------------------------------------------------------- >SF email is sponsored by - The IT Product Guide >Read honest & candid reviews on hundreds of IT Products from real users. >Discover which products truly live up to the hype. Start reading now. >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >_______________________________________________ >Nutch-developers mailing list >[email protected] >https://lists.sourceforge.net/lists/listinfo/nutch-developers = = = = = = = = = = = = = = = = = = = = Jason Tang [EMAIL PROTECTED] 2005-04-01 ------------------------------------------------------- This SF.net email is sponsored by Demarc: A global provider of Threat Management Solutions. Download our HomeAdmin security software for free today! http://www.demarc.com/Info/Sentarus/hamr30 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
