ok Kuro - you are wrong about thai language having spaces between words. Thai don't have space between words and segmenting thai is a bit tricky methinks.
Will appreciate any/all help you can give me cheers, sanjeev sanjeev wrote: > > ok. I downloaded the LuceneInAction code examples from the book and found > there were some > analyzers and tests/demos which included chinese. > > But these analyzers were standalone java programs with a main method. > > My question is how to integrate into nutch so the index created by crawl > process can be searchable in thai ? > > Someone please help as I'm hopelessly confused by the whole thing. :-( > > cheers, > sanjeev. > > > > > > ogjunk-nutch wrote: >> >> Regarding Thai, there is a Thai Analyzer in Lucene already: >> >> $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ >> total 24 >> drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ >> -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java >> -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java >> >> Otis >> >> ----- Original Message ---- >> From: Teruhiko Kurosaka <[EMAIL PROTECTED]> >> To: sanjeev <[EMAIL PROTECTED]>; [email protected] >> Sent: Wednesday, November 8, 2006 2:16:38 PM >> Subject: RE: implement thai lanaguage analyzer in nutch >> >> Sanjay, >> I don't think you should follow the Chinese example and extend the CJK >> range. >> This was needed because Chinese and Japanese don't use space to separate >> words. I believe Thai uses spaces, right? If so, you should extend >> LETTER >> range to include Thai character rather than CJK. >> >> Another place you would need to change is the LanguageIdentifier. >> You would either train it, or implement some hack, in order for it to >> be able to >> detect Thai language documents that are not of HTML with lang="th" >> attribute. >> >> -kuro >> >> >> >> >> > > -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252863 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
