On 2008-04-22 18:33, Bill Janssen wrote:
The 2002 paper "A language and character set determination method
based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and
Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go
about this.

Thanks for the reference.

Looks like the existing research on this just hasn't made it into the
mainstream yet.

Here's their current project: http://www.language-observatory.org/
Looks like they are focusing more on language detection.

Another interesting paper using n-grams:
"Language Identification in Web Pages" by Bruno Martins and Mário J. Silva
http://xldb.fc.ul.pt/data/Publications_attach/ngram-article.pdf

And one using compression:
"Text Categorization Using Compression Models" by     
Eibe Frank, Chang Chui, Ian H. Witten
http://portal.acm.org/citation.cfm?id=789742

They're looking at "LSE"s, language-script-encoding
triples; a "script" is a way of using a particular character set to
write in a particular language.

Their system has these requirements:

R1. the response must be either "correct answer" or "unable to detect"
    where "unable to detect" includes "other than registered" [the
    registered set of LSEs];

R2. Applicable to multi-LSE texts;

R3. never accept a wrong answer, even when the program does not have
    enough data on an LSE; and

R4. applicable to any LSE text.

So, no wrong answers.

The biggest disadvantage would seem to be that the registration data
for a particular LSE is kind of bulky; on the order of 10,000
shift-codons, each of three bytes, about 30K uncompressed.

http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf

For a server based application that doesn't sound too large.

Unless you're using a very broad scope, I don't think that
you'd need more than a few hundred LSEs for a typical
application - nothing you'd want to put in the Python stdlib,
though.

Bill

IMHO, more research has to be done into this area before a
"standard" module can be added to the Python's stdlib... and
who knows, perhaps we're lucky and by the time everyone is
using UTF-8 anyway :-)
I walked over to our computational linguistics group and asked.  This
is often combined with language guessing (which uses a similar
approach, but using characters instead of bytes), and apparently can
usually be done with high confidence.  Of course, they're usually
looking at clean texts, not random "stuff".  I'll see if I can get
some references and report back -- most of the research on this was
done in the 90's.

Bill

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 22 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to