Re: [Python-Dev] bytes type discussion

Bengt Richter Wed, 15 Feb 2006 11:37:47 -0800

On Tue, 14 Feb 2006 15:13:25 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote:


>I'm about to send 6 or 8 replies to various salient messages in the
>PEP 332 revival thread. That's probably a sign that there's still a
>lot to be sorted out. In the mean time, to save you reading through
>all those responses, here's a summary of where I believe I stand.
>Let's continue the discussion in this new thread unless there are
>specific hairs to be split in the other thread that aren't addressed
>below or by later posts.
>
>Non-controversial (or almost):
>
>- we need a new PEP; PEP 332 won't cut it
>
>- no b"..." literal
>
>- bytes objects are mutable
>
>- bytes objects are composed of ints in range(256)
>
>- you can pass any iterable of ints to the bytes constructor, as long
>as they are in range(256)
>
>- longs or anything with an __index__ method should do, too
>
>- when you index a bytes object, you get a plain int
>
>- repr(bytes[1,0 20, 30]) == 'bytes([10, 20, 30])'
>
>Somewhat controversial:
>
>- it's probably too big to attempt to rush this into 2.5
>
>- bytes("abc") == bytes(map(ord, "abc"))
>
>- bytes("\x80\xff") == bytes(map(ord, "\x80\xff")) == bytes([128, 256])
>
>Very controversial:
>
Given that ord/unichr and ord/chr work as encoding-agnostic function pairs 
symmetrically
mapping between unicode and int or str and int, please consider the effect of 
this API
as illustrated by how it works with the examples:

 >>> def bytes(arg, encoding=None):
 ...     if isinstance(arg, str):
 ...         if encoding: b = map(ord, arg.decode(encoding))
 ...         else: b = map(ord, arg)
 ...     elif isinstance(arg, unicode):
 ...         if encoding: raise ValueError(
 ...             'Use bytes(%r.encode(%r)) to avoid PY 3000 breakage'%(arg, 
encoding))
 ...         b = map(ord, arg)
 ...     else:
 ...         b = map(int, arg)
 ...     if sum(1 for x in b if x<0 or x>255) > 0:
 ...         raise ValueError('byte out of range')
 ...     return 'bytes(%r)'%b
 ...
 ...

 
Then

>- bytes("abc", "encoding") == bytes("abc") # ignores the "encoding" argument
(Use encoding, the only requirement is that all the resulting ord values be in 
range(0,256))
 >>> bytes("abc\xf6", 'latin-1')
 'bytes([97, 98, 99, 246])'
 >>> print unichr(246)
 ö
 >>> bytes("abc\xf6", 'cp437')
 'bytes([97, 98, 99, 247])'
 >>> print unichr(247)
 ÷

>
>- bytes(u"abc") == bytes("abc") # for ASCII at least
 >>> bytes(u"abc")
 'bytes([97, 98, 99])'

>
>- bytes(u"\x80\xff") raises UnicodeError
 >>> bytes(u"\x80\xff")
 'bytes([128, 255])'

>
>- bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
 >>> bytes(u"\x80\xff", "latin-1")
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 6, in bytes
 ValueError: Use bytes(u'\x80\xff'.encode('latin-1')) to avoid PY 3000 breakage
 >>> bytes(u'\x80\xff'.encode('latin-1'))
 'bytes([128, 255])'

(If the characters exist in the encoding specified, it will work, otherwise
raises exception. Assumes PY 3000 string encode results in bytes, so it should
work there too ;-)

of course,
 >>> bytes(u'\u1234')
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 12, in bytes
 ValueError: byte out of range
and
 >>> bytes([1,2])
 'bytes([1, 2])'
 >>> bytes([1,-1])
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 12, in bytes
 ValueError: byte out of range
 >>> bytes([1,256])
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
   File "<stdin>", line 12, in bytes
 ValueError: byte out of range

Interestingly, the internal map int on a sequence permits
 >>> bytes(["1", 2, 3L, True, 5.6])
 'bytes([1, 2, 3, 1, 5])'

IOW, any sequence of objects that will convert themselves
to int in range(0,256) will do.

>
>Martin von Loewis's alternative for the "very controversial" set is to
>disallow an encoding argument and (I believe) also to disallow Unicode
>arguments. In 3.0 this would leave us with s.encode(<encoding>) as the
>only way to convert a string (which is always unicode) to bytes. The
>problem with this is that there's no code that works in both 2.x and
>3.0.
>
I hope Martin will reconsider, considering ord/unichr as a symmetric
pair of functions mapping 1:1 to unicode (and ignoring the fact that
this also happens to be the latin-1 mapping ;-)

A test class should be easy, except deciding on appropriate methods
and how the type should be defined. It's the same peculiar problem
as str, i.e., length one would be compatible with int, but not other lengths.
How do we do that?

Regards,
Bengt Richter

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes type discussion

Reply via email to