Indeed, some software does it well though. Our Search Engine (FAST) does it
very well, though as you would expect... It costs!






"This e-mail is from Reed Exhibitions (Gateway House, 28 The Quadrant,
Richmond, Surrey, TW9 1DN, United Kingdom), a division of Reed Business,
Registered in England, Number 678540.  It contains information which is
confidential and may also be privileged.  It is for the exclusive use of the
intended recipient(s).  If you are not the intended recipient(s) please note
that any form of distribution, copying or use of this communication or the
information in it is strictly prohibited and may be unlawful.  If you have
received this communication in error please return it to the sender or call
our switchboard on +44 (0) 20 89107910.  The opinions expressed within this
communication are not necessarily those expressed by Reed Exhibitions." 
Visit our website at http://www.reedexpo.com

-----Original Message-----
From: Paul Hastings
To: CF-Talk
Sent: Thu Jan 04 03:48:04 2007
Subject: Re: English Characters (ONLY) on form fields!

first let me start off by saying detecting *language* from text samples is
very 
very difficult. at best it's still going to be a guess. a general rule is
the 
better the guess the more all-round costly it will be.

Pablo Varando wrote:
> <script language="English">

not a js expert but i think that refers to computer language like
javascript, etc.

>           if(txt.value.charCodeAt(i) > 256){

depending on the form's encoding this isn't going to guarantee *english*.
for 
the "latin" unicode subrange, that codepoint range includes chars used in 
french, german, etc. for instance "Qu'est ce qu'Unicode?" is french but
would 
pass your test. getting kind of silly:

Mi povas manĝi vitron, ĝi ne damaĝas min.

would mostly pass your test but it's obviously not english (it's Esperanto).

if a windows codepage is used, then it could be practically any language,
take 
thai (windows-874) for instance, 0-127 are "english" chars, 128-256 are 100%

*thai* chars.

also for unicode that codepoint range will exclude "valid" currency, 
punctuation, etc. symbols.

i guess you you might try using unicode & adapting the uBlocks CFC (see 
http://www.sustainablegis.com/unicode/testUBlocks.cfm). one approach would
be to 
examine the browser's language settings & specifically exclude those
"non-latin" 
subranges & subset out chars that only used in french, etc. from what's
left.

you can come closer using a language guesser but this would be server side &

involve 3rd party libs (icu4j has a charset detector that works fairly well
but 
it's *not* language).



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Create robust enterprise, web RIAs.
Upgrade & integrate Adobe Coldfusion MX7 with Flex 2
http://ad.doubleclick.net/clk;56760587;14748456;a?http://www.adobe.com/products/coldfusion/flex2/?sdid=LVNU

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:265604
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Reply via email to