On Tue, 14 Feb 2006 12:31:07 -0700, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
>On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote: >> On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote: >> > "\x80".encode('latin-1') >> >> But in 2.5 we can't change that to return a bytes object without >> creating HUGE incompatibilities. > >People could spell it bytes(s.encode('latin-1')) in order to make it >work in 2.X. That spelling would provide a way of ensuring the type >of the return value. UIAM spelling it bytes(map(ord, s)) or bytes(s) # (bytes would do above internally) would work for str or unicode and would be forward compatible. or bytes(s, encoding_name) # if standard mapping is not desired BTW, ord(u'x') has the effect of u'x'.encode('latin-1') Note: >>> s256 = ''.join(chr(i) for i in xrange(256)) >>> assert s256.decode('latin-1') == u''.join(unichr(ord(c)) for c in s256) >>> assert map(ord, s256.decode('latin-1')) == map(ord, s256) == range(256) But this does *not* mean bytes has an implicit encoding!! It just means there is a useful 1:1 mapping between the possible bytes values and the first 256 unicode *characters*, remembering that the latter are *characters* quite apart from whatever encoding the code source may have. This is a nice safe 1:1 abstract correspondence ISTM. > >> You missed the part where I said that introducing the bytes type >> *without* a literal seems to be a good first step. A new type, even >> built-in, is much less drastic than a new literal (which requires >> lexer and parser support in addition to everything else). > >Are you concerned about the implementation effort? If so, I don't >think that's justified since adding a new string prefix should be >pretty straightforward (relative to rest of the effort involved). >Are you comfortable with the proposed syntax? > I'm -1 on special literal at this point. I think a special text-like literal would be misleading, because it suggests that bytes is somehow in the string family of types, which IMO it really isn't. IMO it's semantically more of a builtin array.array('B'). If we adopt the ord/unichr mappings for strings to/from bytes, and of course init also from a suitable integer sequence, we AGNI, I think. Using non-ascii non-escaped characters in string literals for specifying str ord values (as opposed to characters) is bad practice, but escaped ascii-in-whatever-source-encoding and native_literal_in_source_encoding.decode(source_encoding) seem to work: >>> for enc in 'cp437 latin-1 utf-8'.split(): ... print '\n====< %s >===='%enc ... print mkretesc(enc, 0xf6)[1].decode(enc) ... print repr(mkretesc(enc, 0xf6)[1]) ... print mkretesc(enc, 0xf6)[0]() ... t = mkretesc(enc, 0xf6)[0]() ... print t[0], t[1], t[2] ... print ... ====< cp437 >==== # -*- coding: cp437 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('cp437') "# -*- coding: cp437 -*-\ndef foof6(): return '\\xf6', '\x94', '\x94'.decode('cp437')\n" ('\xf6', '\x94', u'\xf6') ÷ ö ö ====< latin-1 >==== # -*- coding: latin-1 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('latin-1') "# -*- coding: latin-1 -*-\ndef foof6(): return '\\xf6', '\xf6', '\xf6'.decode('latin-1')\n" ('\xf6', '\xf6', u'\xf6') ÷ ÷ ö ====< utf-8 >==== # -*- coding: utf-8 -*- def foof6(): return '\xf6', 'ö', 'ö'.decode('utf-8') "# -*- coding: utf-8 -*-\ndef foof6(): return '\\xf6', '\xc3\xb6', '\xc3\xb6'.decode('utf-8')\n" ('\xf6', '\xc3\xb6', u'\xf6') ÷ +¦ ö The source looks the same viewed as characters, but you can see the differences in the repr values. But the consequence of source-encoding ord values determining str values is that if e.g. you imported this foo function from variously encoded sources, only the escaped and unicode have the proper ord value. The middle one comes from the native literal source encoding. So until str becomes unicode, ascii or ascii escapes are a must for ord-specifying. Afer str becomes unicode, escapes will still work, but the unichr/ord symmetry will allow using the full first 256 unicode characters to specify byte type values if desired. (This happens to correspond to latin-1, but don't mention it ;-) It would make possible a round-trippable repr as bytes('...') using ascii+escaped ascii, and full-256 unicode string literals backwards-compatibly after py3k. Have I missed a pitfall? Hope the output got through to your screen. The first and last in the 3-character lines should always be division sign and umlaut o. The problematical middle ones should be cp437 translations of the middle hex values, since that is the screen I copied from (umluat o, division sign, and plus, vertical_bar for the translation of the utf-8 encoding pair. That one illustrates the problem of returning a "character" encoded in utf-8 thinking single-byte ord value.). BTW, should bytes be freezable? Regards, Bengt Richter
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com