Thanks a bunch Shtykh. After reading your tutorial - i understood how to wrap the thaianalyzer over the lucene one.
I got a analysis-th directory in nutch-0.8.1/plugins with a plugin.xml - made the changes in nutch-site.xml and all. From the hadoop logfile it appears the language identifier has been activated and thai appears among the list of supported languages. However I am unable to open the index using luke so I have no way of knowing whether thai is being indexed correctly...here are some excerpts from the hadoop log....... 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.parse.HtmlParseFilter 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.protocol.Protocol 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.QueryFilter 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.net.URLFilter 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.analysis.NutchAnalyzer 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.searcher.Summarizer 2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension point org.apache.nutch.scoring.ScoringFilter 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Registered Plugins: 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Site Query Filter (query-site) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - URL Query Filter (query-url) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Language Identification Parser/Filter (language-identifier) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Registered Extension-Points: 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer) 2549-12-15 11:25:55,638 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter) 2549-12-15 11:25:55,669 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2549-12-15 11:25:55,779 INFO lang.LanguageIdentifier - Language identifier configuration [1-4/2048] 2549-12-15 11:25:56,544 INFO lang.LanguageIdentifier - Language identifier plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) fr(1000) ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000) da(1000) pl(1000) no(1000) nl(1000) 2549-12-15 11:25:56,544 INFO indexer.IndexingFilters - Adding org.apache.nutch.analysis.lang.LanguageIndexingFilter 2549-12-15 11:25:57,091 INFO indexer.Indexer - Optimizing index. 2549-12-15 11:25:57,544 INFO indexer.Indexer - Indexer: done /////////////////////////////////////////////////////////////////////////////////////// and this crawl log............ Fetcher: starting Fetcher: segment: crawlxx3/segments/25491215112523 Fetcher: threads: 10 fetching http://www.pantip.com/cafe redirectCount=0 fetch of http://www.pantip.com/cafe failed with: java.lang.NullPointerException Fetcher: done CrawlDb update: starting CrawlDb update: db: crawlxx3/crawldb CrawlDb update: segment: crawlxx3/segments/25491215112523 CrawlDb update: Merging segment data into db. CrawlDb update: done Generator: starting Generator: segment: crawlxx3/segments/25491215112536 Generator: Selecting best-scoring urls due for fetch. Generator: Partitioning selected urls by host, for politeness. Generator: done. Fetcher: starting Fetcher: segment: crawlxx3/segments/25491215112536 Fetcher: threads: 10 fetching http://www.pantip.com/cafe redirectCount=0 fetch of http://www.pantip.com/cafe failed with: java.lang.NullPointerException Fetcher: done CrawlDb update: starting CrawlDb update: db: crawlxx3/crawldb CrawlDb update: segment: crawlxx3/segments/25491215112536 CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawlxx3/linkdb LinkDb: adding segment: crawlxx3/segments/25491215112523 LinkDb: adding segment: crawlxx3/segments/25491215112536 LinkDb: done Indexer: starting Indexer: linkdb: crawlxx3/linkdb Indexer: adding segment: crawlxx3/segments/25491215112523 Indexer: adding segment: crawlxx3/segments/25491215112536 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawlxx3/indexes Dedup: done Adding crawlxx3/indexes/part-00000 Shtykh Roman wrote: > > Hi, > > I have recently dealt with Japanese support and wrote > how I did it on > http://nislab.human.waseda.ac.jp/blog/?page_id=7 . I > think it'll give you some idea. > > Br, > Roman > > --- sanjeev <[EMAIL PROTECTED]> wrote: > >> >> Hi all, >> >> I am still waiting for some help re: the thai >> language indexing and >> searching. >> >> Please help as i'm quite lost on this one. >> >> Thanks and regards, >> sanjeev. >> >> >> sanjeev wrote: >> > >> > Thanks for clearing up some doubts. But exactly >> how do i wrap it ? >> > Do I need to make changes in code to utilize the >> new thaitokenizer ? >> > If yes - where are the places that need >> modification ? >> > Do I need to download a dev version and do a >> recompile ? >> > >> > Please - if you could possibly tell me the steps - >> in brief - i would be >> > highly obliged. >> > >> > Thanks, >> > sanjeev. >> > >> > >> > >> > >> > Jérôme Charron wrote: >> >> >> >>> i used an existing ThaiAnalyzer which was in >> lucene packlage. >> >>> ok - i renamed the lucene.analysis.th.* to >> nutch.analysis.th.* - >> >>> compiled >> >>> and >> >>> placed all class files in a jar - >> analysis-th.jar (do i need to bundle >> >>> the >> >>> ngp file in the jar as well ?) >> >> >> >> 1. You don't have to refactor the lucene >> analyzer. Just to wrap it like I >> >> do >> >> with french and german analyzers (they both use >> some analyzers from >> >> lucene). >> >> 2. Analyzer doesn't need ngp files... I think >> you misunderstood >> >> something: >> >> 2.1 In one side there is the language identifier >> that use NGP files to >> >> identify language of a document >> >> 2.2 In the other sided if a suitable analyzer is >> found for the identified >> >> language, it is used to analyze the document. >> >> >> >> Regards >> >> >> >> Jérôme >> >> >> >> >> >> -- >> >> http://motrech.free.fr/ >> >> http://www.frutch.org/ >> >> >> >> >> > >> > >> >> -- >> View this message in context: >> > http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7827701 >> Sent from the Nutch - Dev mailing list archive at >> Nabble.com. >> >> > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > -- View this message in context: http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7886152 Sent from the Nutch - Dev mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers