Thanks Jerome,

i used an existing ThaiAnalyzer which was in lucene package.

ok - i renamed the lucene.analysis.th.* to nutch.analysis.th.* - compiled
and
placed all class files in a jar - analysis-th.jar (do i need to bundle the
ngp file in the jar as well ?)

take a look at the log file for a sample crawl - somehow i feel the
language-identifier is still not
activated. 

Need your help urgently in resolving this issue.

cheers and regards and thanks for all your help.

sanjeev.
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-default.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/crawl-tool.xml
491116 151804 parsing
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/nutch-site.xml
491116 151804 No FS indicated, using default:local
491116 151804 crawl started in: crawlnewxx2
491116 151804 rootUrlFile = urls
491116 151804 threads = 10
491116 151804 depth = 10
491116 151804 Created webdb at
LocalFS,C:\cygwin\home\robert\nutch-0.7.2\crawlnewxx2\db
491116 151804 Starting URL processing
491116 151804 Plugins: looking in: C:\cygwin\home\robert\nutch-0.7.2\plugins
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\clustering-carrot2
491116 151804 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\creativecommons
491116 151804 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\index-more
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\language-identifier
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\nutch-extensionpoints\plugin.xml
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\ontology
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-ext
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-html\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-js
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-msword
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-pdf
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-rss
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\parse-text\plugin.xml
491116 151805 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-file
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-ftp
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-http\plugin.xml
491116 151805 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\protocol-httpclient
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-basic\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-more
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-site\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\query-url\plugin.xml
491116 151805 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
491116 151805 not including:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-prefix
491116 151805 parsing:
C:\cygwin\home\robert\nutch-0.7.2\plugins\urlfilter-regex\plugin.xml
491116 151805 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
491116 151805 found resource regex-urlfilter.txt at
file:/C:/cygwin/home/robert/nutch-0.7.2/conf/regex-urlfilter.txt
491116 151805 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer










Jérôme Charron wrote:
> 
>> ok. I was able to enable the language identifier plugin by adding the
>> value
>> in plugin.includes attribute
>> in nutch-site.xml - but i'm not sure just by doing that I can have thai
>> text
>> recognized and tokenized
>> properly.
>> What else do I have to do ? Please help me.
> 
> 1. You must create a thai NGP (Ngram Profile file) so that the language
> identifier can identify thai !
> 2. You must create a thai analyzer (see for instance analysis-fr and
> analysis-de sample analyzers).
> 
> Best Regards
> 
> Jérôme
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-language-indexing-and-search-tf2641172.html#a7375925
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to