On Tue, 9 Aug 2005 16:11:39 +0430, mohsen ali momeni <[EMAIL PROTECTED]> posted to gmane.comp.misc.persiancomputing: > How can I auto-detect language of a webpage without knowing it's > charset? (suppose language and charset is not defined in header) > Is there a simple (not time-consuming) method to detect a page charset?
Sorry for the late reply -- I just stumbled over your question via Google. I'm hoping I might be able to help, a little bit. There are tools which attempt to match both the language and the character set of a piece of text at the same time, which indeed kind of solves the "chicken and egg" problem you stumble into if you attempt to tackle the problems separately. A popular free tool is TextCat, which contains language (and incidentally charset) identification for almost 50 languages, although not for Farsi. ... However, it would seem that the Arabic language detection which is included in the distribution has in fact been trained with Farsi data to some extent. Also it would seem to be getting the charset wrong (as per the garbage in / garbage out principle). I know neither Farsi nor Arabic so I would appreciate if somebody could confirm this information. (I'm hoping this message will go to the list too, but I'm not subscribed, so it might get rejected.) In any event, the language models which are included in the TextCat distribution are "proof of concept" ones, not really production quality. Another tool which comes with better models is mguesser. And in fact, the mguesser models can be used more or less directly by TextCat (and vice versa). If you have a sizable corpus of good-quality Persian / Farsi text (sorry, I'm vague about the possible difference) to train the tool with, it would be most helpful if you could make it (the raw data, or your training results) available. Links: <http://odur.let.rug.nl/~vannoord/TextCat/> Perl source + (small) models + on-line demo + good documentation <http://www.mnogosearch.org/guesser/> C source + models; site is somewhat poor quality but the software is OK The TextCat site also has a page with a survey of the competition (although mguesser is mysteriously missing from the list). Hope this helps, /* era */ -- If this were a real .signature, it would suck less. Well, maybe not. _______________________________________________ PersianComputing mailing list PersianComputing@lists.sharif.edu http://lists.sharif.edu/mailman/listinfo/persiancomputing