Re: [Nutch-dev] implement thai language indexing and search

sanjeev Thu, 14 Dec 2006 20:38:09 -0800

Thanks a bunch Shtykh.

After reading your tutorial - i understood how to wrap the thaianalyzer over
the lucene one.


I got a analysis-th directory in nutch-0.8.1/plugins with a plugin.xml -
made the changes in 
nutch-site.xml and all. 

From the hadoop logfile it appears the language identifier has been
activated and thai appears 
among the list of supported languages. 

However I am unable to open the index using luke so I have no way of knowing
whether thai is 
being indexed correctly...here are some excerpts from the hadoop log.......


2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.parse.HtmlParseFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.protocol.Protocol
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.QueryFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.net.URLFilter
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.analysis.NutchAnalyzer
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.searcher.Summarizer
2549-12-15 11:25:55,638 DEBUG plugin.PluginRepository - Adding extension
point org.apache.nutch.scoring.ScoringFilter
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Registered Plugins:
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Site Query 
Filter
(query-site)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Regex URL Filter
Framework (lib-regex-filter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Basic Summarizer
Plug-in (summary-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Text Parse 
Plug-in
(parse-text)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         JavaScript 
Parser
(parse-js)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Regex URL Filter
(urlfilter-regex)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Basic Query 
Filter
(query-basic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         HTTP Framework
(lib-http)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         URL Query Filter
(query-url)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Language
Identification Parser/Filter (language-identifier)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository - Registered
Extension-Points:
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Online 
Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Ontology Model
Loader (org.apache.nutch.ontology.Ontology)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2549-12-15 11:25:55,638 INFO  plugin.PluginRepository -         Nutch Query 
Filter
(org.apache.nutch.searcher.QueryFilter)
2549-12-15 11:25:55,669 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2549-12-15 11:25:55,779 INFO  lang.LanguageIdentifier - Language identifier
configuration [1-4/2048]
2549-12-15 11:25:56,544 INFO  lang.LanguageIdentifier - Language identifier
plugin supports: it(1000) is(1000) hu(1000) th(1000) sv(1000) fr(1000)
ru(1000) fi(1000) es(1000) en(1000) el(1000) ee(1000) pt(1000) de(1000)
da(1000) pl(1000) no(1000) nl(1000)
2549-12-15 11:25:56,544 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.analysis.lang.LanguageIndexingFilter
2549-12-15 11:25:57,091 INFO  indexer.Indexer - Optimizing index.
2549-12-15 11:25:57,544 INFO  indexer.Indexer - Indexer: done

///////////////////////////////////////////////////////////////////////////////////////
and this crawl log............

Fetcher: starting
Fetcher: segment: crawlxx3/segments/25491215112523
Fetcher: threads: 10
fetching http://www.pantip.com/cafe
redirectCount=0
fetch of http://www.pantip.com/cafe failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawlxx3/crawldb
CrawlDb update: segment: crawlxx3/segments/25491215112523
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: crawlxx3/segments/25491215112536
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawlxx3/segments/25491215112536
Fetcher: threads: 10
fetching http://www.pantip.com/cafe
redirectCount=0
fetch of http://www.pantip.com/cafe failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawlxx3/crawldb
CrawlDb update: segment: crawlxx3/segments/25491215112536
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawlxx3/linkdb
LinkDb: adding segment: crawlxx3/segments/25491215112523
LinkDb: adding segment: crawlxx3/segments/25491215112536
LinkDb: done
Indexer: starting
Indexer: linkdb: crawlxx3/linkdb
Indexer: adding segment: crawlxx3/segments/25491215112523
Indexer: adding segment: crawlxx3/segments/25491215112536
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawlxx3/indexes
Dedup: done
Adding crawlxx3/indexes/part-00000









Shtykh Roman wrote:
> 
> Hi,
> 
> I have recently dealt with Japanese support and wrote
> how I did it on
> http://nislab.human.waseda.ac.jp/blog/?page_id=7 . I
> think it'll give you some idea.
> 
> Br,
> Roman
> 
> --- sanjeev <[EMAIL PROTECTED]> wrote:
> 
>> 
>> Hi all,
>> 
>> I am still waiting for some help re: the thai
>> language indexing and
>> searching.
>> 
>> Please help as i'm quite lost on this one.
>> 
>> Thanks and regards,
>> sanjeev.
>> 
>> 
>> sanjeev wrote:
>> > 
>> > Thanks for clearing up some doubts. But exactly
>> how do i wrap it ?
>> > Do I need to make changes in code to utilize the
>> new thaitokenizer ?
>> > If yes - where are the places that need
>> modification ? 
>> > Do I need to download a dev version and do a
>> recompile ?
>> > 
>> > Please - if you could possibly tell me the steps -
>> in brief - i would be
>> > highly obliged.
>> > 
>> > Thanks,
>> > sanjeev.
>> > 
>> > 
>> > 
>> > 
>> > Jérôme Charron wrote:
>> >> 
>> >>> i used an existing ThaiAnalyzer which was in
>> lucene packlage.
>> >>> ok - i renamed the lucene.analysis.th.* to
>> nutch.analysis.th.* -
>> >>> compiled
>> >>> and
>> >>> placed all class files in a jar -
>> analysis-th.jar (do i need to bundle
>> >>> the
>> >>> ngp file in the jar as well ?)
>> >> 
>> >> 1. You don't have to refactor the lucene
>> analyzer. Just to wrap it like I
>> >> do
>> >> with french and german analyzers (they both use
>> some analyzers from
>> >> lucene).
>> >>  2. Analyzer doesn't need ngp files... I think
>> you misunderstood
>> >> something:
>> >> 2.1 In one side there is the language identifier
>> that use NGP files to
>> >> identify language of a document
>> >> 2.2 In the other sided if a suitable analyzer is
>> found for the identified
>> >> language, it is used to analyze the document.
>> >> 
>> >> Regards
>> >> 
>> >> Jérôme
>> >> 
>> >> 
>> >> -- 
>> >> http://motrech.free.fr/
>> >> http://www.frutch.org/
>> >> 
>> >> 
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>>
> http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7827701
>> Sent from the Nutch - Dev mailing list archive at
>> Nabble.com.
>> 
>> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7886152
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] implement thai language indexing and search

Reply via email to