Re: Does anybody "really" understand character encodings?

Jon Gunnip Fri, 15 Sep 2006 09:27:13 -0700

We've only support ASCII in our database and so we handle the paste
from Word issue by identifying the most common non-ASCII Word
characters and replacing them with ASCII equivalents:


Replace(Local.Text, chr(8211), "-", "all"); /* short dash from MS Word */
Replace(Local.Text, chr(8212), "--", "all"); /* long dash from MS Word */
Replace(Local.Text, chr(8216), "'", "all"); /* left single quote from MS Word */
Replace(Local.Text, chr(8217), "'", "all"); /* right single quote from
MS Word */
Replace(Local.Text, chr(8220), '"', "all"); /* left double quote from MS Word */
Replace(Local.Text, chr(8221), '"', "all"); /* right double quote from
MS Word */

We identify the Word chr() equivalent by pasting the Word characters
into a textarea, submitting to the server, and then looping over the
characters: WriteOutput(myChar & ": " & asc(myChar)).

Jon


On 9/15/06, Paul Hastings <[EMAIL PROTECTED]> wrote:
> Block, Jon wrote:
> > - copy/paste from microsoft word
>
> probably windows-1252 superset of latin-1 which often confuses people.
>
> > - XML export from InDesign UTF-16
>
> to simplify things it would be best to try to get utf-8 out of this
> thing. or at least which "endian" it is.
>
> > - XML export from Quark
>
> from i remember, these yahoos refused to support unicode. i guess they
> might have changed since then (4-5 years ago).
>
> > In all 3 of the cases I've described above, the orign software is
> > putting through characters that do not display correctly on the web.
>
> depends on the encoding of your web application, database, etc. AND how
> stable the encodings are coming out of your 3 data sources (plus
> whatever hijinks the user OS gets up when they do copy & paste).
>
> > The problem I'm having is that some of the characters such as an
> > ellipsis mark or hyphen. When I run into these characters, they display
> > as the wrong character... sometimes a question mark. Othertimes a square
> > box... yet other times sequences of characters that are just totally
> > crazy.
>
> a box is simply that the browser can't render that char using the given
> font family, usually easily fixed. a question mark is another, sad,
> story. it means the data is garbaged usually from mixing up encodings.
>
> > My basic understanding of character encoding tells me that I want to
> > reduce all of the characters down to ASCII. I do not know a good way to
> > do this.
>
> why? you will eventually lose information & perhaps context of some
> text. it would be better to normalize your encodings on unicode (utf-8).
> fits w/cf, & easy enough to use/understand.
>
> > How can I accept text from each of the above mentioned sources, perhaps
> > others, and somehow *normalize* all of the character data into a set of
> > characters that will display properly on my page every time?
>
> converting to unicode, there are plenty of tools around for this. the
> next release of icu4j is supposed to have some heavy duty tools (there
> is a fairly good charset detector in the current version, but it's not
> foolproof but then again hardly anything is when it comes to this).
> quark export is probably going to be the most fun to handle.
>
>
>
> 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:253274
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4

Re: Does anybody "really" understand character encodings?

Reply via email to