We've only support ASCII in our database and so we handle the paste from Word issue by identifying the most common non-ASCII Word characters and replacing them with ASCII equivalents:
Replace(Local.Text, chr(8211), "-", "all"); /* short dash from MS Word */ Replace(Local.Text, chr(8212), "--", "all"); /* long dash from MS Word */ Replace(Local.Text, chr(8216), "'", "all"); /* left single quote from MS Word */ Replace(Local.Text, chr(8217), "'", "all"); /* right single quote from MS Word */ Replace(Local.Text, chr(8220), '"', "all"); /* left double quote from MS Word */ Replace(Local.Text, chr(8221), '"', "all"); /* right double quote from MS Word */ We identify the Word chr() equivalent by pasting the Word characters into a textarea, submitting to the server, and then looping over the characters: WriteOutput(myChar & ": " & asc(myChar)). Jon On 9/15/06, Paul Hastings <[EMAIL PROTECTED]> wrote: > Block, Jon wrote: > > - copy/paste from microsoft word > > probably windows-1252 superset of latin-1 which often confuses people. > > > - XML export from InDesign UTF-16 > > to simplify things it would be best to try to get utf-8 out of this > thing. or at least which "endian" it is. > > > - XML export from Quark > > from i remember, these yahoos refused to support unicode. i guess they > might have changed since then (4-5 years ago). > > > In all 3 of the cases I've described above, the orign software is > > putting through characters that do not display correctly on the web. > > depends on the encoding of your web application, database, etc. AND how > stable the encodings are coming out of your 3 data sources (plus > whatever hijinks the user OS gets up when they do copy & paste). > > > The problem I'm having is that some of the characters such as an > > ellipsis mark or hyphen. When I run into these characters, they display > > as the wrong character... sometimes a question mark. Othertimes a square > > box... yet other times sequences of characters that are just totally > > crazy. > > a box is simply that the browser can't render that char using the given > font family, usually easily fixed. a question mark is another, sad, > story. it means the data is garbaged usually from mixing up encodings. > > > My basic understanding of character encoding tells me that I want to > > reduce all of the characters down to ASCII. I do not know a good way to > > do this. > > why? you will eventually lose information & perhaps context of some > text. it would be better to normalize your encodings on unicode (utf-8). > fits w/cf, & easy enough to use/understand. > > > How can I accept text from each of the above mentioned sources, perhaps > > others, and somehow *normalize* all of the character data into a set of > > characters that will display properly on my page every time? > > converting to unicode, there are plenty of tools around for this. the > next release of icu4j is supposed to have some heavy duty tools (there > is a fairly good charset detector in the current version, but it's not > foolproof but then again hardly anything is when it comes to this). > quark export is probably going to be the most fun to handle. > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:253274 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

