Thank you Mr. Teruhiko Kurosaka,
I was able to locate the th.ngp file in nutch-0.8-dev distrib. I was able to compile the disstrib. When I ran the crawl - I'm not sure it picked up the language identifier. I added <implementation id="org.apache.nutch.analysis.th.ThaiAnalyzer" class="org.apache.nutch.analysis.th.ThaiAnalyzer" lang="th"/> to languageidentifier/plugin.xml Then I ran a crawl and got a stupid error. dedup ... Dedup: adding indexes in: crawlnewpantip14nov2/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432) at org.apache.nutch.crawl.Crawl.main(Crawl.java:131) Your help much appreciated. Teruhiko Kurosaka wrote: > > Oh, Thai words are not space delimited? > OK, in that case, you'd need to study how ThaiAnalyzer works and > then modify the rules in NutchAnalysis.jj (if you are going to use > the web search GUI from Nutch). This is because the search > expressions are parsed by the parser generated from NutchAnalysis.jj > first before each term is handed to the language specific analyzer, > and currently if a character belongs to the CJK category, each character > is treated as though it were a word. If ThaiAnalyzer does not do the > same, > you can index the Thai docs but you won't be able to find any doc unless > the search term is one Unicode character. > > > -kuro > >> -----Original Message----- >> From: sanjeev [mailto:[EMAIL PROTECTED] >> Sent: 2006-11-08 19:28 >> To: nutch-dev@lucene.apache.org >> Subject: Re: implement thai lanaguage analyzer in nutch >> >> >> I need a Thai Analyzer for Nutch. I want the crawler to be >> intelligent enough >> to split thai words correctly since thai don't have spaces >> between words. >> :-( >> >> >> >> >> ogjunk-nutch wrote: >> > >> > Regarding Thai, there is a Thai Analyzer in Lucene already: >> > >> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ >> > total 24 >> > drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ >> > -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java >> > -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java >> > >> > Otis >> > >> > ----- Original Message ---- >> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]> >> > To: sanjeev <[EMAIL PROTECTED]>; >> nutch-dev@lucene.apache.org >> > Sent: Wednesday, November 8, 2006 2:16:38 PM >> > Subject: RE: implement thai lanaguage analyzer in nutch >> > >> > Sanjay, >> > I don't think you should follow the Chinese example and >> extend the CJK >> > range. >> > This was needed because Chinese and Japanese don't use >> space to separate >> > words. I believe Thai uses spaces, right? If so, you should extend >> > LETTER >> > range to include Thai character rather than CJK. >> > >> > Another place you would need to change is the LanguageIdentifier. >> > You would either train it, or implement some hack, in >> order for it to >> > be able to >> > detect Thai language documents that are not of HTML with lang="th" >> > attribute. >> > >> > -kuro >> > >> > >> > >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut >> ch-tf2587282.html#a7251826 >> Sent from the Nutch - Dev mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7334391 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers