Re: implement thai lanaguage analyzer in nutch

sanjeev Wed, 08 Nov 2006 21:48:59 -0800

ok. I downloaded the LuceneInAction code examples from the book and found
there were some 
analyzers and tests/demos which included chinese.


But these analyzers were standalone java programs with a main method.

My question is how to integrate into nutch so the index created by crawl
process can be searchable in thai ?

Someone please help as I'm hopelessly confused by the whole thing. :-(

cheers,
sanjeev.





ogjunk-nutch wrote:
> 
> Regarding Thai, there is a Thai Analyzer in Lucene already:
> 
> $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
> total 24
> drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
> -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
> -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
> 
> Otis
> 
> ----- Original Message ----
> From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
> To: sanjeev <[EMAIL PROTECTED]>; nutch-dev@lucene.apache.org
> Sent: Wednesday, November 8, 2006 2:16:38 PM
> Subject: RE: implement thai lanaguage analyzer in nutch
> 
> Sanjay,
> I don't think you should follow the Chinese example and extend the CJK
> range. 
> This was needed because Chinese and Japanese don't use space to separate
> words.  I believe Thai uses spaces, right? If so, you should extend
> LETTER
> range to include Thai character rather than CJK.
> 
> Another place you would need to change is the LanguageIdentifier. 
> You would either train it, or implement some hack,  in order for it to
> be able to 
> detect Thai language documents that are not of HTML with lang="th"
> attribute.
> 
> -kuro
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7252838
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: implement thai lanaguage analyzer in nutch

Reply via email to