RE: latin1 decoder implementation

Whistler, Ken Fri, 16 Nov 2012 14:34:57 -0800

The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC 
8859-1 (Latin-1), but you need to distinguish what happens for the graphic 
characters from what happens for the control codes.


ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF. 
Those are exactly identical to the Unicode characters at the respective code 
points.

ISO 8859-1 does *not* define control code usage, in the ranges 0x00..0x1F, 
0x7F..0x9F. What that standard says is:

"The shaded positions in the code table [i.e. 0x00..0x1F, 0x7F..0x9F] 
correspond to bit combinations that do not represent graphic characters. Their 
use is outside the scope of ISO/IEC 8859; it is specified in other 
International Standards, for example ISO/IEC 6429."

What character set conversions for ISO 8859 character encodings [almost] all 
currently assume is that control code usage for the C0 set (0x00..0x1F, 0x7F) 
and the C1 set (0x80..0x9F) correspond to the control functions defined by 
ISO/IEC 6429. Which is also what the Unicode Standard implicitly assumes for 
U+0000..U+001F, U+007F..U+009F. So one-to-one conversions of the control codes 
is the correct thing to do. Even in the occasional cases where data using other 
control function conventions besides ISO 6429 is converted, the control code 
values are preserved through conversion to Unicode this way.

So, yes, Python is correct in converting all 256 values 0x00..0xFF in Latin-1 
data to U+0000..U+00FF in Unicode.

But no, this does *not* imply that the Unicode Standard has inserted character 
definitions into ISO/IEC 8859-1.

--Ken

Restated, are the first 256 characters of unicode intended to be exactly 
compatible with a latin1 codec?
This would imply that unicode has inserted character definitions into the 
ISO-8859-1 standard.

RE: latin1 decoder implementation

Reply via email to