On Wed, 14 Dec 2005, Bill Hacker wrote:

> Simply set 'UTF-8' in the meta-data of the webpage.
> 
> ISO-8859-1 is a (mostly) proper subset, but not the reverse.

That isn't strictly true. Confusion arises because UTF-8 is not, 
strictly, a character encoding. It is a way of encoding (compressing, 
really) a sequence of numbers whose values need up to 24 bits to 
represent in binary into a string of 8-bit bytes, where the first 128 
numbers are represented by single bytes.

Unicode is a character encoding that defines character code points, also
values up to 24 bits, though the majority are within the 16 bit limit.
Unicode is often represented using the UTF-8 value encoding, but not 
always. Some applications use straight 16-bit values. However, in the 
context of many applications, including, it seems, the web, the name 
"UTF-8" has become synonymous with "Unicode, encoded as UTF-8".

ISO-8859-1 code values are a subset of Unicode code values. However, 
ISO-8859-1 code values are always represented as single bytes. This 
means that values 0-127 are indeed identical to the UTF-8 values 0-127. 
However, the remaining ISO-8859-1 code points (128-255), though they 
encode the same characters as Unicode, are not represented in the same 
way. In ISO-8859-1 these values are single bytes; in Unicode/UTF-8 they
require two bytes. Take, for example, the character whose Unicode and 
ISO-8859-1 code point is 00F7 (the divide symbol). In ISO-8859-1 this
would be the single byte with hex value F7; in UTF-8 this value is coded
as two bytes C3, B7.

Therefore, if you have a file that contains ISO-8859-1 and it contains 
characters in the range 128-255, you cannot just pretend that it is 
UTF-8 Unicode. In fact, it will most probably be invalid as a UTF-8 file 
because the bytes with the top bit set won't, in general, form valid
UTF-8 sequences. Some of them, though (e.g. the sequence C3, B7) will be 
valid as UTF-8. So you will get a mess.

-- 
Philip Hazel            University of Cambridge Computing Service,
[EMAIL PROTECTED]      Cambridge, England. Phone: +44 1223 334714.
Get the Exim 4 book:    http://www.uit.co.uk/exim-book

-- 
## List details at http://www.exim.org/mailman/listinfo/exim-users 
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://www.exim.org/eximwiki/

Reply via email to