On Tue, 14 Feb 2006 12:31:07 -0700, Neil Schemenauer <[EMAIL PROTECTED]> wrote:

>On Mon, Feb 13, 2006 at 08:07:49PM -0800, Guido van Rossum wrote:
>> On 2/13/06, Neil Schemenauer <[EMAIL PROTECTED]> wrote:
>> >     "\x80".encode('latin-1')
>> 
>> But in 2.5 we can't change that to return a bytes object without
>> creating HUGE incompatibilities.
>
>People could spell it bytes(s.encode('latin-1')) in order to make it
>work in 2.X.  That spelling would provide a way of ensuring the type
>of the return value.
UIAM spelling it
    bytes(map(ord, s))
or
    bytes(s)  # (bytes would do above internally)

would work for str or unicode and would be forward compatible.
or
    bytes(s, encoding_name) # if standard mapping is not desired

BTW, ord(u'x') has the effect of u'x'.encode('latin-1')
Note:
 >>> s256 = ''.join(chr(i) for i in xrange(256))
 >>> assert s256.decode('latin-1') == u''.join(unichr(ord(c)) for c in s256)
 >>> assert map(ord, s256.decode('latin-1')) == map(ord, s256) == range(256)

But this does *not* mean bytes has an implicit encoding!! It just means
there is a useful 1:1 mapping between the possible bytes values and the
first 256 unicode *characters*, remembering that the latter are *characters*
quite apart from whatever encoding the code source may have.

This is a nice safe 1:1 abstract correspondence ISTM.
>
>> You missed the part where I said that introducing the bytes type
>> *without* a literal seems to be a good first step. A new type, even
>> built-in, is much less drastic than a new literal (which requires
>> lexer and parser support in addition to everything else).
>
>Are you concerned about the implementation effort?  If so, I don't
>think that's justified since adding a new string prefix should be
>pretty straightforward (relative to rest of the effort involved).
>Are you comfortable with the proposed syntax?
>

I'm -1 on special literal at this point. I think a special text-like literal
would be misleading, because it suggests that bytes is somehow in the
string family of types, which IMO it really isn't.
IMO it's semantically more of a builtin array.array('B').

If we adopt the ord/unichr mappings for strings to/from bytes, and
of course init also from a suitable integer sequence, we AGNI, I think.

Using non-ascii non-escaped characters in string literals for specifying
str ord values (as opposed to characters) is bad practice, but escaped
ascii-in-whatever-source-encoding and 
native_literal_in_source_encoding.decode(source_encoding)
seem to work:

 >>> for enc in 'cp437 latin-1 utf-8'.split():
 ...     print '\n====< %s >===='%enc
 ...     print mkretesc(enc, 0xf6)[1].decode(enc)
 ...     print repr(mkretesc(enc, 0xf6)[1])
 ...     print mkretesc(enc, 0xf6)[0]()
 ...     t = mkretesc(enc, 0xf6)[0]()
 ...     print t[0], t[1], t[2]
 ...     print
 ...
 
 ====< cp437 >====
 # -*- coding: cp437 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('cp437')
 
 "# -*- coding: cp437 -*-\ndef foof6(): return '\\xf6', '\x94', 
'\x94'.decode('cp437')\n"
 ('\xf6', '\x94', u'\xf6')
 ÷ ö ö
 
 
 ====< latin-1 >====
 # -*- coding: latin-1 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('latin-1')
 
 "# -*- coding: latin-1 -*-\ndef foof6(): return '\\xf6', '\xf6', 
'\xf6'.decode('latin-1')\n"
 ('\xf6', '\xf6', u'\xf6')
 ÷ ÷ ö
 
 
 ====< utf-8 >====
 # -*- coding: utf-8 -*-
 def foof6(): return '\xf6', 'ö', 'ö'.decode('utf-8')
 
 "# -*- coding: utf-8 -*-\ndef foof6(): return '\\xf6', '\xc3\xb6', 
'\xc3\xb6'.decode('utf-8')\n"
 
 ('\xf6', '\xc3\xb6', u'\xf6')
 ÷ +¦ ö
 
The source looks the same viewed as characters, but you can see the differences 
in the repr values.
But the consequence of source-encoding ord values determining str values is 
that if e.g. you imported
this foo function from variously encoded sources, only the escaped and unicode 
have the proper ord value.
The middle one comes from the native literal source encoding.

So until str becomes unicode, ascii or ascii escapes are a must for 
ord-specifying. Afer str becomes unicode,
escapes will still work, but the unichr/ord symmetry will allow using the full 
first 256 unicode characters
to specify byte type values if desired. (This happens to correspond to latin-1, 
but don't mention it ;-)

It would make possible a round-trippable repr as bytes('...')
using ascii+escaped ascii, and full-256 unicode string literals 
backwards-compatibly after py3k.
Have I missed a pitfall? Hope the output got through to your screen. The first 
and last in the 3-character
lines should always be division sign and umlaut o. The problematical middle 
ones should be cp437 translations
of the middle hex values, since that is the screen I copied from (umluat o, 
division sign, and plus, vertical_bar
for the translation of the utf-8 encoding pair. That one illustrates the 
problem of returning a "character"
encoded in utf-8 thinking single-byte ord value.).

BTW, should bytes be freezable?

Regards,
Bengt Richter

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to