Thanks Jerome, i used an existing ThaiAnalyzer which was in lucene package.
ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled and placed all class files in a jar - analysis-th.jar (do i need to bundle the ngp file in the jar as well ?) take a look at the log file for a sample crawl - somehow i feel the language-identifier is still not activated. Need your help urgently in resolving this issue. cheers and regards and thanks for all your help. sanjeev. 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-default.xml 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/crawl-tool.xml 491116 151804 parsing file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-site.xml 491116 151804 No FS indicated, using default:local 491116 151804 crawl started in: crawlnewxx2 491116 151804 rootUrlFile = urls 491116 151804 threads = 10 491116 151804 depth = 10 491116 151804 Created webdb at LocalFS,C:\cygwin\home\robert\nutch-0.7.2\crawlnewxx2\db 491116 151804 Starting URL processing 491116 151804 Plugins: looking in: C:\cygwin\home\robert\nutch-0.7.2\plugins 491116 151804 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\clustering-carrot2 491116 151804 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\creativecommons 491116 151804 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\index-basic\plugin.xml 491116 151805 impl: point=org.apache.nutch.indexer.IndexingFilter class=org.apache.nutch.indexer.basic.BasicIndexingFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\index-more 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\language-identifier 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\ontology 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-ext 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-html\plugin.xml 491116 151805 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.html.HtmlParser 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-js 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-msword 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-pdf 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-rss 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-text\plugin.xml 491116 151805 impl: point=org.apache.nutch.parse.Parser class=org.apache.nutch.parse.text.TextParser 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-file 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-ftp 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-http\plugin.xml 491116 151805 impl: point=org.apache.nutch.protocol.Protocol class=org.apache.nutch.protocol.http.Http 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-httpclient 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-basic\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.basic.BasicQueryFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-more 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-site\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.site.SiteQueryFilter 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\query-url\plugin.xml 491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter class=org.apache.nutch.searcher.url.URLQueryFilter 491116 151805 not including: C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-prefix 491116 151805 parsing: C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml 491116 151805 impl: point=org.apache.nutch.net.URLFilter class=org.apache.nutch.net.RegexURLFilter 491116 151805 found resource regex-urlfilter.txt at file:/C:/cygwin/home/robert/nutch-0.7.2/conf/regex-urlfilter.txt 491116 151805 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer Jérôme Charron wrote: > >> ok. I was able to enable the language identifier plugin by adding the >> value >> in plugin.includes attribute >> in nutch-site.xml - but i'm not sure just by doing that I can have thai >> text >> recognized and tokenized >> properly. >> What else do I have to do ? Please help me. > > 1. You must create a thai NGP (Ngram Profile file) so that the language > identifier can identify thai ! > 2. You must create a thai analyzer (see for instance analysis-fr and > analysis-de sample analyzers). > > Best Regards > > Jérôme > > -- View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375925 Sent from the Nutch - Dev mailing list archive at Nabble.com.