I'm sorry - the thai analyzer is in lucene package like so.... org.apache.lucene.analysis.th.ThaiAnalyzer
So I'm sure it didn't pickup the language identifier. What should I do now ? rename the package to nutch ? Can someone please help me ? Thanks and much appreciated again. cheers, sanjeev. sanjeev wrote: > > Thank you Mr. Teruhiko Kurosaka, > > > I was able to locate the th.ngp file in nutch-0.8-dev distrib. > > I was able to compile the disstrib. When I ran the crawl - I'm not sure it > picked up the > language identifier. I added > > <implementation id="org.apache.nutch.analysis.th.ThaiAnalyzer" > class="org.apache.nutch.analysis.th.ThaiAnalyzer" lang="th"/> > > to languageidentifier/plugin.xml > > Then I ran a crawl and got a stupid error. dedup ... > > Dedup: adding indexes in: crawlnewpantip14nov2/indexes > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393) > at > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:131) > > Your help much appreciated. > > > > Teruhiko Kurosaka wrote: >> >> Oh, Thai words are not space delimited? >> OK, in that case, you'd need to study how ThaiAnalyzer works and >> then modify the rules in NutchAnalysis.jj (if you are going to use >> the web search GUI from Nutch). This is because the search >> expressions are parsed by the parser generated from NutchAnalysis.jj >> first before each term is handed to the language specific analyzer, >> and currently if a character belongs to the CJK category, each character >> is treated as though it were a word. If ThaiAnalyzer does not do the >> same, >> you can index the Thai docs but you won't be able to find any doc unless >> the search term is one Unicode character. >> >> >> -kuro >> >>> -----Original Message----- >>> From: sanjeev [mailto:[EMAIL PROTECTED] >>> Sent: 2006-11-08 19:28 >>> To: nutch-dev@lucene.apache.org >>> Subject: Re: implement thai lanaguage analyzer in nutch >>> >>> >>> I need a Thai Analyzer for Nutch. I want the crawler to be >>> intelligent enough >>> to split thai words correctly since thai don't have spaces >>> between words. >>> :-( >>> >>> >>> >>> >>> ogjunk-nutch wrote: >>> > >>> > Regarding Thai, there is a Thai Analyzer in Lucene already: >>> > >>> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/ >>> > total 24 >>> > drwxrwxr-x 7 otis otis 4096 Oct 27 02:08 .svn/ >>> > -rw-rw-r-- 1 otis otis 1528 Jun 5 14:27 ThaiAnalyzer.java >>> > -rw-rw-r-- 1 otis otis 2437 Jun 5 14:27 ThaiWordFilter.java >>> > >>> > Otis >>> > >>> > ----- Original Message ---- >>> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]> >>> > To: sanjeev <[EMAIL PROTECTED]>; >>> nutch-dev@lucene.apache.org >>> > Sent: Wednesday, November 8, 2006 2:16:38 PM >>> > Subject: RE: implement thai lanaguage analyzer in nutch >>> > >>> > Sanjay, >>> > I don't think you should follow the Chinese example and >>> extend the CJK >>> > range. >>> > This was needed because Chinese and Japanese don't use >>> space to separate >>> > words. I believe Thai uses spaces, right? If so, you should extend >>> > LETTER >>> > range to include Thai character rather than CJK. >>> > >>> > Another place you would need to change is the LanguageIdentifier. >>> > You would either train it, or implement some hack, in >>> order for it to >>> > be able to >>> > detect Thai language documents that are not of HTML with lang="th" >>> > attribute. >>> > >>> > -kuro >>> > >>> > >>> > >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut >>> ch-tf2587282.html#a7251826 >>> Sent from the Nutch - Dev mailing list archive at Nabble.com. >>> >>> >> >> > > -- View this message in context: http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7335375 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers