first let me start off by saying detecting *language* from text samples is very
very difficult. at best it's still going to be a guess. a general rule is the
better the guess the more all-round costly it will be.
Pablo Varando wrote:
> <script language="English">
not a js expert but i think that refers to computer language like javascript,
etc.
> if(txt.value.charCodeAt(i) > 256){
depending on the form's encoding this isn't going to guarantee *english*. for
the "latin" unicode subrange, that codepoint range includes chars used in
french, german, etc. for instance "Qu'est ce qu'Unicode?" is french but would
pass your test. getting kind of silly:
Mi povas manÄi vitron, Äi ne damaÄas min.
would mostly pass your test but it's obviously not english (it's Esperanto).
if a windows codepage is used, then it could be practically any language, take
thai (windows-874) for instance, 0-127 are "english" chars, 128-256 are 100%
*thai* chars.
also for unicode that codepoint range will exclude "valid" currency,
punctuation, etc. symbols.
i guess you you might try using unicode & adapting the uBlocks CFC (see
http://www.sustainablegis.com/unicode/testUBlocks.cfm). one approach would be
to
examine the browser's language settings & specifically exclude those
"non-latin"
subranges & subset out chars that only used in french, etc. from what's left.
you can come closer using a language guesser but this would be server side &
involve 3rd party libs (icu4j has a charset detector that works fairly well but
it's *not* language).
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Create robust enterprise, web RIAs.
Upgrade & integrate Adobe Coldfusion MX7 with Flex 2
http://ad.doubleclick.net/clk;56760587;14748456;a?http://www.adobe.com/products/coldfusion/flex2/?sdid=LVNU
Archive:
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:265592
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4