On Tue, 14 Feb 2006 15:14:07 -0800, Guido van Rossum <[EMAIL PROTECTED]> wrote:
>On 2/14/06, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: >> Guido van Rossum wrote: >> > As Phillip guessed, I was indeed thinking about introducing bytes() >> > sooner than that, perhaps even in 2.5 (though I don't want anything >> > rushed). >> >> Hmm, that is probably going to be too early. As the thread shows >> there are lots of things to take into account, esp. since if you >> plan to introduce bytes() in 2.x, the upgrade path to 3.x would >> have to be carefully planned. Otherwise, we end up introducing >> a feature which is meant to prepare for 3.x and then we end up >> causing breakage when the move is finally implemented. > >You make a good point. Someone probably needs to write up a new PEP >summarizing this discussion (or rather, consolidating the agreement >that is slowly emerging, where there is agreement, and summarizing the >key open questions). > >> > Even in Py3k though, the encoding issue stands -- what if the file >> > encoding is Unicode? Then using Latin-1 to encode bytes by default >> > might not by what the user expected. Or what if the file encoding is >> > something totally different? (Cyrillic, Greek, Japanese, Klingon.) >> > Anything default but ASCII isn't going to work as expected. ASCII >> > isn't going to work as expected either, but it will complain loudly >> > (by throwing a UnicodeError) whenever you try it, rather than causing >> > subtle bugs later. >> >> I think there's a misunderstanding here: in Py3k, all "string" >> literals will be converted from the source code encoding to >> Unicode. There are no ambiguities - a Klingon character will still >> map to the same ordinal used to create the byte content regardless >> of whether the source file is encoded in UTF-8, UTF-16 or >> some Klingon charset (are there any ?). > >OK, so a string (literal or otherwise) containing a Klingon character >won't be acceptable to the bytes() constructor in 3.0. It shouldn't be >in 2.x either then. > >I still think that someone who types a file in Latin-1 and enters >non-ASCII Latin-1 characters in a string literal and then passes it to >the bytes() constructor might expect to get bytes encoded in Latin-1, >and someone who types a file in UTF-8 and enters non-ASCII Unicode >characters might expect to get UTF-8-encoded bytes. Since they can't >both get what they want, we should disallow both, and only allow >ASCII. ISTM this is a good rule for backwards compatibility for the '...' => u'...' py3k transition. I don't know if you saw my other post, but I was suggesting that bytes(s_or_u) should be mapped to the integer values by the current definition of ord for either str or unicode. UIAM this works when you convert ASCII and will work if you convert the ASCII string to unicode. It will also let you use unicode _currently_ to get past the ASCII restriction, since ord(u) works for all of the first 256 unicode characters. Using those characters in bytes(u'...') works even if your source encoding is utf-8 and contains ascii escapes, e.g. >>> utfsrc = """\ ... # -*- coding: utf-8 -*- ... umlaut_os, values = u'\xf6\\xf6', map(ord, u'\xf6\\xf6') ... """.decode('latin-1').encode('utf-8') Hopefully showing on your screen properly: >>> print utfsrc.decode('utf-8') # -*- coding: utf-8 -*- umlaut_os, values = u'ö\xf6', map(ord, u'ö\xf6') And the repr, where you can see the utf-8 double chars for utf-8 and the \\xf6 ascii escape: >>> print repr(utfsrc) "# -*- coding: utf-8 -*-\numlaut_os, values = u'\xc3\xb6\\xf6', map(ord, u'\xc3\xb6\\xf6')\n" compiling the utf-8 source and executing it: >>> exec compile(utfsrc,'','exec') Good results: >>> umlaut_os, map(hex, values) (u'\xf6\xf6', ['0xf6', '0xf6']) >>> print umlaut_os öö So map(s_or_u) works predictably now, and will not break after py3k unless you use non-ascii in _plain_ str strings now. But in unicode it should be ok even now. I think ord is a consistent and handy mapping of characters to bytes, and the fact that it works for unicode for all 256 characters seems to me a boon. (So long as no one gets upset that ord(u) _happens_ to match ord(u.encode('latin-1')) ;-) I didn't see yet where you had ruled against ord mapping of unicode to bytes, so I am hopeful that you will consider it. >> Furthermore, by restricting to ASCII you'd also outrule hex escapes >> which seem to be the natural choice for presenting binary data in >> literals - the Unicode representation would then only be an >> implementation detail of the way Python treats "string" literals >> and a user would certainly expect to find e.g. \x88 in the bytes object >> if she writes bytes('\x88'). > >I guess we'l just have to disappoint her. Too bad for the person who >wrote bytes("\x12\x34\x56\x78\x9a\xbc\xde\xf0") -- they'll have to >write bytes([0x12,0x34,0x56,0x78,0x9a,0xbc,0xde,0xf0]). Not so bad IMO >and certainly easier than a *mixture* of hex and ASCII like >'\xabc\xdef'. > >> But maybe you have something different in mind... I'm talking >> about ways to create bytes() in Py3k using "string" literals. > >I'm not sure that's going to be common practive except for ASCII >characters used in network protocols. > >> >> While we're at it: I'd suggest that we remove the auto-conversion >> >> from bytes to Unicode in Py3k and the default encoding along with >> >> it. >> > >> > I'm not sure which auto-conversion you're talking about, since there >> > is no bytes type yet. If you're talking about the auto-conversion from >> > str to unicode: the bytes type should not be assumed to have *any* >> > properties that the current str type has, and that includes >> > auto-conversion. >> >> I was talking about the automatic conversion of 8-bit strings to >> Unicode - which was a key feature to make the introduction of >> Unicode less painful, but will no longer be necessary in Py3k. > >OK. The bytes type certainly won't have this property. > Yes, ISTM bytes is an array of unsigned 8-bit numbers, and allowing handy initializations by passing str or unicode to the constructor is no different from allowing >>> array.array('B', map(ord, u'some chars \xf6\x00\x01\x02\xff')) array('B', [115, 111, 109, 101, 32, 99, 104, 97, 114, 115, 32, 246, 0, 1, 2, 255]) letting bytes accept the string with internal ord mapping seems a handy concise initialization _option_, though passing the int list is fine too. bytes(u'some chars \xf6\x00\x01\x02\xff') BTW, that could serve as a round-trippable repr, but bytes([115, 111, 109, 101, 32, 99, 104, 97, 114, 115, 32, 246, 0, 1, 2, 255]) is truer to the nature of bytes as int array. I don;t know. It seems for symmetry with bytes(s_str) str(bytes(s_str)) should return s_str now and unicode later and unicode(bytes(u'(restricted to first 256)')) should return the identical unicode. That all falls out for free I think. ISTM the ord/unichr symmetry allows us to ignore most of the encoding issues (except currently restricting to plain ascii in plain str literals that must retain their ord meaning as unicode after py3k turns them into unicode). BTW, I posted one (I think) other post with this essential idea, and there are a lot of posts I am tempted to respond to, but I will restrain myself ;-) Regards, Bengt Richter
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com