RE: implement thai lanaguage analyzer in nutch

Teruhiko Kurosaka Fri, 10 Nov 2006 09:58:24 -0800

Oh, Thai words are not space delimited?
OK, in that case, you'd need to study how ThaiAnalyzer works and
then modify the rules in NutchAnalysis.jj (if you are going to use
the web search GUI from Nutch).  This is because the search
expressions are parsed by the parser generated from NutchAnalysis.jj
first before each term is handed to the language specific analyzer,
and currently if a character belongs to the CJK category, each character
is treated as though it were a word.  If ThaiAnalyzer does not do the
same,
you can index the Thai docs but you won't be able to find any doc unless
the search term is one Unicode character.



-kuro

> -----Original Message-----
> From: sanjeev [mailto:[EMAIL PROTECTED] 
> Sent: 2006-11-08 19:28
> To: nutch-dev@lucene.apache.org
> Subject: Re: implement thai lanaguage analyzer in nutch
> 
> 
> I need a Thai Analyzer for Nutch. I want the crawler to be 
> intelligent enough
> to split thai words correctly since thai don't have spaces 
> between words.
> :-(
> 
> 
> 
> 
> ogjunk-nutch wrote:
> > 
> > Regarding Thai, there is a Thai Analyzer in Lucene already:
> > 
> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
> > total 24
> > drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
> > -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
> > -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
> > 
> > Otis
> > 
> > ----- Original Message ----
> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
> > To: sanjeev <[EMAIL PROTECTED]>; 
> nutch-dev@lucene.apache.org
> > Sent: Wednesday, November 8, 2006 2:16:38 PM
> > Subject: RE: implement thai lanaguage analyzer in nutch
> > 
> > Sanjay,
> > I don't think you should follow the Chinese example and 
> extend the CJK
> > range. 
> > This was needed because Chinese and Japanese don't use 
> space to separate
> > words.  I believe Thai uses spaces, right? If so, you should extend
> > LETTER
> > range to include Thai character rather than CJK.
> > 
> > Another place you would need to change is the LanguageIdentifier. 
> > You would either train it, or implement some hack,  in 
> order for it to
> > be able to 
> > detect Thai language documents that are not of HTML with lang="th"
> > attribute.
> > 
> > -kuro
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
> ch-tf2587282.html#a7251826
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
>

RE: implement thai lanaguage analyzer in nutch

Reply via email to