Re: [Nutch-dev] implement thai lanaguage analyzer in nutch

sanjeev Tue, 14 Nov 2006 08:48:41 -0800

Thank you Mr. Teruhiko Kurosaka,


I was able to locate the th.ngp file in nutch-0.8-dev distrib.

I was able to compile the disstrib. When I ran the crawl - I'm not sure it
picked up the 
language identifier. I added 

 <implementation id="org.apache.nutch.analysis.th.ThaiAnalyzer"
class="org.apache.nutch.analysis.th.ThaiAnalyzer" lang="th"/> 

to languageidentifier/plugin.xml

Then I ran a crawl and got a stupid error. dedup ...

Dedup: adding indexes in: crawlnewpantip14nov2/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)

Your help much appreciated.



Teruhiko Kurosaka wrote:
> 
> Oh, Thai words are not space delimited?
> OK, in that case, you'd need to study how ThaiAnalyzer works and
> then modify the rules in NutchAnalysis.jj (if you are going to use
> the web search GUI from Nutch).  This is because the search
> expressions are parsed by the parser generated from NutchAnalysis.jj
> first before each term is handed to the language specific analyzer,
> and currently if a character belongs to the CJK category, each character
> is treated as though it were a word.  If ThaiAnalyzer does not do the
> same,
> you can index the Thai docs but you won't be able to find any doc unless
> the search term is one Unicode character.
> 
> 
> -kuro
> 
>> -----Original Message-----
>> From: sanjeev [mailto:[EMAIL PROTECTED] 
>> Sent: 2006-11-08 19:28
>> To: nutch-dev@lucene.apache.org
>> Subject: Re: implement thai lanaguage analyzer in nutch
>> 
>> 
>> I need a Thai Analyzer for Nutch. I want the crawler to be 
>> intelligent enough
>> to split thai words correctly since thai don't have spaces 
>> between words.
>> :-(
>> 
>> 
>> 
>> 
>> ogjunk-nutch wrote:
>> > 
>> > Regarding Thai, there is a Thai Analyzer in Lucene already:
>> > 
>> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
>> > total 24
>> > drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
>> > -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
>> > -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
>> > 
>> > Otis
>> > 
>> > ----- Original Message ----
>> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
>> > To: sanjeev <[EMAIL PROTECTED]>; 
>> nutch-dev@lucene.apache.org
>> > Sent: Wednesday, November 8, 2006 2:16:38 PM
>> > Subject: RE: implement thai lanaguage analyzer in nutch
>> > 
>> > Sanjay,
>> > I don't think you should follow the Chinese example and 
>> extend the CJK
>> > range. 
>> > This was needed because Chinese and Japanese don't use 
>> space to separate
>> > words.  I believe Thai uses spaces, right? If so, you should extend
>> > LETTER
>> > range to include Thai character rather than CJK.
>> > 
>> > Another place you would need to change is the LanguageIdentifier. 
>> > You would either train it, or implement some hack,  in 
>> order for it to
>> > be able to 
>> > detect Thai language documents that are not of HTML with lang="th"
>> > attribute.
>> > 
>> > -kuro
>> > 
>> > 
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
>> ch-tf2587282.html#a7251826
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7334391
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] implement thai lanaguage analyzer in nutch

Reply via email to