Re: farsi language auto-detection in web pages

2005-08-10 Thread Behdad Esfahbod
On Wed, 10 Aug 2005, mohsen ali momeni wrote:

 Hi,
 Thanks for reply,
 What I exatly need is CP1256 detection, and after that detecting
 whether the language is persian or not.

As you can guess, all non-Unicode character sets share the same
8-bit space, so they overlap all the time.  Your only bet at
charset detection is to look at the areas that are left unencoded
in each character set and cross-out charsets as use those
forbidden areas.  As for language detection, that can be used in
charset detection too, you can look for the string SPACE REH
ALEPH SPACE as a good indicator of Persian.


 Regards,

--behdad
http://behdad.org/
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-09 Thread Paul Hastings

mohsen ali momeni wrote:

How can I auto-detect language of a webpage without knowing it's
charset? (suppose language and charset is not defined in header)


kind of a lousy way to do things but ...


Is there a simple (not time-consuming) method to detect a page charset?


http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html
http://icu.sourceforge.net/apiref/icu4j/com/ibm/icu/text/CharsetMatch.html
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing


Re: farsi language auto-detection in web pages

2005-08-09 Thread Behdad Esfahbod
On Tue, 9 Aug 2005, mohsen ali momeni wrote:

 Hi,

 How can I auto-detect language of a webpage without knowing it's
 charset? (suppose language and charset is not defined in header)
 Is there a simple (not time-consuming) method to detect a page charset?

If it's UTF-8 or UTF-16, kinda easy, not really otherwise.


 Regards,

--behdad
http://behdad.org/
___
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing