On Sat, 5 Oct 2013 15:12:09 -0400, John Gilmore wrote: >Consider the botched French-language text > >A Montréal, a la fin des années 80, . . . > >It should of course be > >A Montréal, a la fin des années 80 . . . > >The difficulty arises when a convention for representing 'é' as two >successive byte values of the form > ><-minuscule-e code point><accent-aigu code point> > >in one code page collides with the single-byte representation of 'Ã' >and '©' as just these two unique code points in another code page. > >Regrettably, Unicode has carried alternative support for the generic > ><basic alphabetic code point><modifier code point> > >scheme forward; and its availability and heavy use in some contexts >needs to figure in the sorts of discussions that have been going on >here during the last few days. It greatly complicates translation in >a fashion that is of no conceptual interest but is messy. > The form represented by the pair of characters in the collision you chose for your example is UTF-8. The UTF-8 representation of 'é' (e accent-aigu) is formed by taking the 8 bits that represent that character in ISO 8859-1 and imbedding them across 2 bytes in the bit pattern 110000xx 10xxxxxx, just like all the other ISO 8859-1 characters from hex A0 to hex FF are represented in UTF-8. Hex E9 becomes hex C3 A9. How can this be described as a convention that uses the form <basic alphabetic code point><modifier code point>?
Bill ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
