Re: latin1 and cp1252 inconsistent?

buck Fri, 16 Nov 2012 15:33:00 -0800

On Friday, November 16, 2012 2:34:32 PM UTC-8, Ian wrote:
> On Fri, Nov 16, 2012 at 2:44 PM,  <buck> wrote:
> 
> > Latin1 has a block of 32 undefined characters.
> 
> 
> These characters are not undefined.  0x80-0x9f are the C1 control
> codes in Latin-1, much as 0x00-0x1f are the C0 control codes, and
> their Unicode mappings are well defined.


They are indeed undefined: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf

""" The shaded positions in the code table correspond
    to bit combinations that do not represent graphic
    characters. Their use is outside the scope of
    ISO/IEC 8859; it is specified in other International
    Standards, for example ISO/IEC 6429.


However it's reasonable for 0x81 to decode to U+81 because the unicode standard 
says: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

""" The semantics of the control codes are generally determined by the 
application with which they are used. However, in the absence of specific 
application uses, they may be interpreted according to the control function 
semantics specified in ISO/IEC 6429:1992.


> You can use a non-strict error handling scheme to prevent the error.
> >>> b'hello \x81 world'.decode('cp1252', 'replace')
> 'hello \ufffd world'

This creates a non-reversible encoding, and loss of data, which isn't 
acceptable for my application.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: latin1 and cp1252 inconsistent?

Reply via email to