Block, Jon wrote: > - copy/paste from microsoft word probably windows-1252 superset of latin-1 which often confuses people.
> - XML export from InDesign UTF-16 to simplify things it would be best to try to get utf-8 out of this thing. or at least which "endian" it is. > - XML export from Quark from i remember, these yahoos refused to support unicode. i guess they might have changed since then (4-5 years ago). > In all 3 of the cases I've described above, the orign software is > putting through characters that do not display correctly on the web. depends on the encoding of your web application, database, etc. AND how stable the encodings are coming out of your 3 data sources (plus whatever hijinks the user OS gets up when they do copy & paste). > The problem I'm having is that some of the characters such as an > ellipsis mark or hyphen. When I run into these characters, they display > as the wrong character... sometimes a question mark. Othertimes a square > box... yet other times sequences of characters that are just totally > crazy. a box is simply that the browser can't render that char using the given font family, usually easily fixed. a question mark is another, sad, story. it means the data is garbaged usually from mixing up encodings. > My basic understanding of character encoding tells me that I want to > reduce all of the characters down to ASCII. I do not know a good way to > do this. why? you will eventually lose information & perhaps context of some text. it would be better to normalize your encodings on unicode (utf-8). fits w/cf, & easy enough to use/understand. > How can I accept text from each of the above mentioned sources, perhaps > others, and somehow *normalize* all of the character data into a set of > characters that will display properly on my page every time? converting to unicode, there are plenty of tools around for this. the next release of icu4j is supposed to have some heavy duty tools (there is a fairly good charset detector in the current version, but it's not foolproof but then again hardly anything is when it comes to this). quark export is probably going to be the most fun to handle. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:253272 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

