ok. I downloaded the LuceneInAction code examples from the book and found there were some analyzers and tests/demos which included chinese.
But these analyzers were standalone java programs with a main method. My question is how to integrate into nutch so the index created by crawl process can be searchable in thai ? Someone please help as I'm hopelessly confused by the whole thing. :-( cheers, sanjeev. ogjunk-nutch wrote: > > Regarding Thai, there is a Thai Analyzer in Lucene already: > > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ > total 24 > drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ > -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java > -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java > > Otis > > ----- Original Message ---- > From: Teruhiko Kurosaka <[EMAIL PROTECTED]> > To: sanjeev <[EMAIL PROTECTED]>; nutch-dev@lucene.apache.org > Sent: Wednesday, November 8, 2006 2:16:38 PM > Subject: RE: implement thai lanaguage analyzer in nutch > > Sanjay, > I don't think you should follow the Chinese example and extend the CJK > range. > This was needed because Chinese and Japanese don't use space to separate > words. I believe Thai uses spaces, right? If so, you should extend > LETTER > range to include Thai character rather than CJK. > > Another place you would need to change is the LanguageIdentifier. > You would either train it, or implement some hack, in order for it to > be able to > detect Thai language documents that are not of HTML with lang="th" > attribute. > > -kuro > > > > > -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252838 Sent from the Nutch - Dev mailing list archive at Nabble.com.