if you're using unicode (or don't mind converting the text to unicode) you
could determine the unicode block(s) of the text to guess the "ballpark"
language tossing out anything with a majority of text that's not "basic
latin" see http://www.sustainablegis.com/unicode/testUBlocks.cfm for an
example. after that you could simply see if there are any non-english chars
(well practically anything past \u007E). its no where near fool proof but it
is free & easy.

if you want near certainty then you'd need something like xerox's language
guesser:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser-ISO-8859-1.en.html

or its unicode cousin:

http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser.en.html
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]

Reply via email to