At 17:45 -0800 on 01/09/2014, Charles Mills wrote about Re: Subject
Unicode:
You could use 8 bits for most characters, with cleverness that
expanded that out to two or three bytes for more obscure characters.
Pretty efficient, and you could make the first part of the character
set the same as ASCII, which would make it intuitive for PC folks
who "know" that A is X'41'. That is called UTF-8, and it's pretty
good and pretty popular as a result. Most Web pages are in UTF-8 and
I believe this e-mail came to you in UTF-8.
Note that that "ASCII" is "US-ASCII" and is codepoints x00 to x7f.
UTF-8 maps US-ASCII to its single byte codepoint. Any codepoint from
x80 to xff (from ISO-8859-1 or Windows-1252 [which is ISO-8859-1 from
xa0 to xff with the useless ISO-8859-1 x80 to x9F codepoints replaced
with 32 extra useful glyphs such as curved quotes and the euro
symbol] which the normal mapping used for email and accented
characters/etc) is mapped as 2 bytes (the high half of each byte is a
x8 to xf nibble).
For more info (and the gruesome details <g>), look at
https://en.wikipedia.org/wiki/UTF8.
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN