Re: latin1 decoder implementation

Buck Golemon Fri, 16 Nov 2012 14:05:53 -0800

When decoding bytes to unicode using the "latin1" scheme, there are three
options for bytes not defined in the ISO-8859-1 standard.


1) Throw an error.
2) Insert the replacement glyph (fffd), indicating an unknown character.
3) Insert the unicode character with equal value. This means that
completely random bytes will always decode successfully.

The Python language currently implements option three. Is this correct?
There is an option to produce errors or replacements for encodings which
have undefined characters, but as implemented, latin1 currently defines
characters for all 256 bytes, so the option does nothing.

Restated, are the first 256 characters of unicode intended to be exactly
compatible with a latin1 codec?
This would imply that unicode has inserted character definitions into the
ISO-8859-1 standard.

Re: latin1 decoder implementation

Reply via email to