Regarding Thai, there is a Thai Analyzer in Lucene already: $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ total 24 drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java
Otis ----- Original Message ---- From: Teruhiko Kurosaka <[EMAIL PROTECTED]> To: sanjeev <[EMAIL PROTECTED]>; [email protected] Sent: Wednesday, November 8, 2006 2:16:38 PM Subject: RE: implement thai lanaguage analyzer in nutch Sanjay, I don't think you should follow the Chinese example and extend the CJK range. This was needed because Chinese and Japanese don't use space to separate words. I believe Thai uses spaces, right? If so, you should extend LETTER range to include Thai character rather than CJK. Another place you would need to change is the LanguageIdentifier. You would either train it, or implement some hack, in order for it to be able to detect Thai language documents that are not of HTML with lang="th" attribute. -kuro ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
