Re: farsi language auto-detection in web pages
On Wed, 10 Aug 2005, mohsen ali momeni wrote: Hi, Thanks for reply, What I exatly need is CP1256 detection, and after that detecting whether the language is persian or not. As you can guess, all non-Unicode character sets share the same 8-bit space, so they overlap all the time. Your only bet at charset detection is to look at the areas that are left unencoded in each character set and cross-out charsets as use those forbidden areas. As for language detection, that can be used in charset detection too, you can look for the string SPACE REH ALEPH SPACE as a good indicator of Persian. Regards, --behdad http://behdad.org/ ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: farsi language auto-detection in web pages
mohsen ali momeni wrote: How can I auto-detect language of a webpage without knowing it's charset? (suppose language and charset is not defined in header) kind of a lousy way to do things but ... Is there a simple (not time-consuming) method to detect a page charset? http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing
Re: farsi language auto-detection in web pages
On Tue, 9 Aug 2005, mohsen ali momeni wrote: Hi, How can I auto-detect language of a webpage without knowing it's charset? (suppose language and charset is not defined in header) Is there a simple (not time-consuming) method to detect a page charset? If it's UTF-8 or UTF-16, kinda easy, not really otherwise. Regards, --behdad http://behdad.org/ ___ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing