Sanjay,
I don't think you should follow the Chinese example and extend the CJK
range. 
This was needed because Chinese and Japanese don't use space to separate
words.  I believe Thai uses spaces, right? If so, you should extend
LETTER
range to include Thai character rather than CJK.

Another place you would need to change is the LanguageIdentifier. 
You would either train it, or implement some hack,  in order for it to
be able to 
detect Thai language documents that are not of HTML with lang="th"
attribute.

-kuro

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to