I'm sorry - the thai analyzer is in lucene package like so....

org.apache.lucene.analysis.th.ThaiAnalyzer

So I'm sure it didn't pickup the language identifier.

What should I do now ? rename the package to nutch ? 

Can someone please help me ? 


Thanks and much appreciated again.

cheers,
sanjeev.


sanjeev wrote:
> 
> Thank you Mr. Teruhiko Kurosaka,
> 
> 
> I was able to locate the th.ngp file in nutch-0.8-dev distrib.
> 
> I was able to compile the disstrib. When I ran the crawl - I'm not sure it
> picked up the 
> language identifier. I added 
> 
>  <implementation id="org.apache.nutch.analysis.th.ThaiAnalyzer"
> class="org.apache.nutch.analysis.th.ThaiAnalyzer" lang="th"/> 
> 
> to languageidentifier/plugin.xml
> 
> Then I ran a crawl and got a stupid error. dedup ...
> 
> Dedup: adding indexes in: crawlnewpantip14nov2/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:393)
>       at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:432)
>       at org.apache.nutch.crawl.Crawl.main(Crawl.java:131)
> 
> Your help much appreciated.
> 
> 
> 
> Teruhiko Kurosaka wrote:
>> 
>> Oh, Thai words are not space delimited?
>> OK, in that case, you'd need to study how ThaiAnalyzer works and
>> then modify the rules in NutchAnalysis.jj (if you are going to use
>> the web search GUI from Nutch).  This is because the search
>> expressions are parsed by the parser generated from NutchAnalysis.jj
>> first before each term is handed to the language specific analyzer,
>> and currently if a character belongs to the CJK category, each character
>> is treated as though it were a word.  If ThaiAnalyzer does not do the
>> same,
>> you can index the Thai docs but you won't be able to find any doc unless
>> the search term is one Unicode character.
>> 
>> 
>> -kuro
>> 
>>> -----Original Message-----
>>> From: sanjeev [mailto:[EMAIL PROTECTED] 
>>> Sent: 2006-11-08 19:28
>>> To: nutch-dev@lucene.apache.org
>>> Subject: Re: implement thai lanaguage analyzer in nutch
>>> 
>>> 
>>> I need a Thai Analyzer for Nutch. I want the crawler to be 
>>> intelligent enough
>>> to split thai words correctly since thai don't have spaces 
>>> between words.
>>> :-(
>>> 
>>> 
>>> 
>>> 
>>> ogjunk-nutch wrote:
>>> > 
>>> > Regarding Thai, there is a Thai Analyzer in Lucene already:
>>> > 
>>> > $ ll contrib/analyzers/src/java/org/apache/lucene/analysis/th/
>>> > total 24
>>> > drwxrwxr-x  7 otis otis 4096 Oct 27 02:08 .svn/
>>> > -rw-rw-r--  1 otis otis 1528 Jun  5 14:27 ThaiAnalyzer.java
>>> > -rw-rw-r--  1 otis otis 2437 Jun  5 14:27 ThaiWordFilter.java
>>> > 
>>> > Otis
>>> > 
>>> > ----- Original Message ----
>>> > From: Teruhiko Kurosaka <[EMAIL PROTECTED]>
>>> > To: sanjeev <[EMAIL PROTECTED]>; 
>>> nutch-dev@lucene.apache.org
>>> > Sent: Wednesday, November 8, 2006 2:16:38 PM
>>> > Subject: RE: implement thai lanaguage analyzer in nutch
>>> > 
>>> > Sanjay,
>>> > I don't think you should follow the Chinese example and 
>>> extend the CJK
>>> > range. 
>>> > This was needed because Chinese and Japanese don't use 
>>> space to separate
>>> > words.  I believe Thai uses spaces, right? If so, you should extend
>>> > LETTER
>>> > range to include Thai character rather than CJK.
>>> > 
>>> > Another place you would need to change is the LanguageIdentifier. 
>>> > You would either train it, or implement some hack,  in 
>>> order for it to
>>> > be able to 
>>> > detect Thai language documents that are not of HTML with lang="th"
>>> > attribute.
>>> > 
>>> > -kuro
>>> > 
>>> > 
>>> > 
>>> > 
>>> > 
>>> 
>>> -- 
>>> View this message in context: 
>>> http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nut
>>> ch-tf2587282.html#a7251826
>>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/implement-thai-lanaguage-analyzer-in-nutch-tf2587282.html#a7335375
Sent from the Nutch - Dev mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to