On Tue, 9 Aug 2005 16:11:39 +0430, mohsen ali momeni
<[EMAIL PROTECTED]> posted to gmane.comp.misc.persiancomputing:
 > How can I auto-detect language of a webpage without knowing it's
 > charset? (suppose language and charset is not defined in header)
 > Is there a simple (not time-consuming) method to detect a page charset?

Sorry for the late reply -- I just stumbled over your question via Google.
I'm hoping I might be able to help, a little bit.

There are tools which attempt to match both the language and the
character set of a piece of text at the same time, which indeed kind
of solves the "chicken and egg" problem you stumble into if you
attempt to tackle the problems separately. A popular free tool is
TextCat, which contains language (and incidentally charset)
identification for almost 50 languages, although not for Farsi.

... However, it would seem that the Arabic language detection which is
included in the distribution has in fact been trained with Farsi data
to some extent. Also it would seem to be getting the charset wrong (as
per the garbage in / garbage out principle).

I know neither Farsi nor Arabic so I would appreciate if somebody
could confirm this information. (I'm hoping this message will go to
the list too, but I'm not subscribed, so it might get rejected.)

In any event, the language models which are included in the TextCat
distribution are "proof of concept" ones, not really production
quality. Another tool which comes with better models is mguesser. And
in fact, the mguesser models can be used more or less directly by
TextCat (and vice versa).

If you have a sizable corpus of good-quality Persian / Farsi text
(sorry, I'm vague about the possible difference) to train the tool
with, it would be most helpful if you could make it (the raw data, or
your training results) available.

Links:
  <http://odur.let.rug.nl/~vannoord/TextCat/>
     Perl source + (small) models + on-line demo + good documentation

  <http://www.mnogosearch.org/guesser/>
     C source + models; site is somewhat poor quality but the software is OK

The TextCat site also has a page with a survey of the competition
(although mguesser is mysteriously missing from the list).

Hope this helps,

/* era */

-- 
If this were a real .signature, it would suck less.  Well, maybe not.

_______________________________________________
PersianComputing mailing list
PersianComputing@lists.sharif.edu
http://lists.sharif.edu/mailman/listinfo/persiancomputing

Reply via email to