could determine the unicode block(s) of the text to guess the "ballpark"
language tossing out anything with a majority of text that's not "basic
latin" see http://www.sustainablegis.com/unicode/testUBlocks.cfm for an
example. after that you could simply see if there are any non-english chars
(well practically anything past \u007E). its no where near fool proof but it
is free & easy.
if you want near certainty then you'd need something like xerox's language
guesser:
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser-ISO-8859-1.en.html
or its unicode cousin:
http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser.en.html
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]

