Re: Subject Unicode

Robert A. Rosenberg Thu, 09 Jan 2014 20:28:09 -0800

At 17:45 -0800 on 01/09/2014, Charles Mills wrote about Re: SubjectUnicode:

You could use 8 bits for most characters, with cleverness thatexpanded that out to two or three bytes for more obscure characters.Pretty efficient, and you could make the first part of the characterset the same as ASCII, which would make it intuitive for PC folkswho "know" that A is X'41'. That is called UTF-8, and it's prettygood and pretty popular as a result. Most Web pages are in UTF-8 andI believe this e-mail came to you in UTF-8.

Note that that "ASCII" is "US-ASCII" and is codepoints x00 to x7f.UTF-8 maps US-ASCII to its single byte codepoint. Any codepoint fromx80 to xff (from ISO-8859-1 or Windows-1252 [which is ISO-8859-1 fromxa0 to xff with the useless ISO-8859-1 x80 to x9F codepoints replacedwith 32 extra useful glyphs such as curved quotes and the eurosymbol] which the normal mapping used for email and accentedcharacters/etc) is mapped as 2 bytes (the high half of each byte is ax8 to xf nibble).

For more info (and the gruesome details <g>), look athttps://en.wikipedia.org/wiki/UTF8.


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: Subject Unicode

Reply via email to