Re: Embedded two-byte representations of marked alphabetic characters in SBCSs

Bill Godfrey Sat, 05 Oct 2013 14:09:25 -0700

On Sat, 5 Oct 2013 15:12:09 -0400, John Gilmore wrote:

>Consider the botched French-language text
>
>A Montréal, a la fin des années 80, . . .
>
>It should of course be
>
>A MontrÃ©al, a la fin des annÃ©es 80 . . .
>
>The difficulty arises when a convention for representing 'é'  as two
>successive byte values of the form
>
><-minuscule-e code point><accent-aigu code point>
>
>in one code page collides with the single-byte representation of 'Ã'
>and '©' as just these two unique code points in another code page.
>
>Regrettably, Unicode has carried alternative support for the generic
>
><basic alphabetic code point><modifier code point>
>
>scheme forward; and its availability and heavy use in some contexts
>needs to figure in the sorts of discussions that have been going on
>here during the last few days.   It greatly complicates translation in
>a fashion that is of no conceptual interest but is messy.
>
The form represented by the pair of characters in the collision you chose for 
your example is UTF-8. The UTF-8 representation of 'é' (e accent-aigu) is 
formed by taking the 8 bits that represent that character in ISO 8859-1 and 
imbedding them across 2 bytes in the bit pattern 110000xx 10xxxxxx, just like 
all the other ISO 8859-1 characters from hex A0 to hex FF are represented in 
UTF-8. Hex E9 becomes hex C3 A9. How can this be described as a convention that 
uses the form <basic alphabetic code point><modifier code point>?


Bill

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Embedded two-byte representations of marked alphabetic characters in SBCSs

Reply via email to