The 2002 paper "A language and character set determination method based on N-gram statistics" by Izumi Suzuki and Yoshiki Mikami and Ario Ohsato and Yoshihide Chubachi seems to me a pretty good way to go about this. They're looking at "LSE"s, language-script-encoding triples; a "script" is a way of using a particular character set to write in a particular language.
Their system has these requirements: R1. the response must be either "correct answer" or "unable to detect" where "unable to detect" includes "other than registered" [the registered set of LSEs]; R2. Applicable to multi-LSE texts; R3. never accept a wrong answer, even when the program does not have enough data on an LSE; and R4. applicable to any LSE text. So, no wrong answers. The biggest disadvantage would seem to be that the registration data for a particular LSE is kind of bulky; on the order of 10,000 shift-codons, each of three bytes, about 30K uncompressed. http://portal.acm.org/ft_gateway.cfm?id=772759&type=pdf Bill > > IMHO, more research has to be done into this area before a > > "standard" module can be added to the Python's stdlib... and > > who knows, perhaps we're lucky and by the time everyone is > > using UTF-8 anyway :-) > > I walked over to our computational linguistics group and asked. This > is often combined with language guessing (which uses a similar > approach, but using characters instead of bytes), and apparently can > usually be done with high confidence. Of course, they're usually > looking at clean texts, not random "stuff". I'll see if I can get > some references and report back -- most of the research on this was > done in the 90's. > > Bill _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com