Re: [Python-Dev] What to do for bytes in 2.6?

glyph Sat, 19 Jan 2008 23:50:08 -0800

On 04:26 am, [EMAIL PROTECTED] wrote:

On Jan 19, 2008 5:54 PM,  <[EMAIL PROTECTED]> wrote:

On 19 Jan, 07:32 pm, [EMAIL PROTECTED] wrote:

Starting with the most relevant bit before getting off into digressionsthat may not interest most people:

Why can't we get that warning in -3 mode just the same from something
read from a socket and a b"" literal?

If you really want this, please think through all the consequences,
and report back here. While I have a hunch that it'll end up giving
too many false positives and at the same time too many false
negatives, perhaps I haven't thought it through enough. But if you
really think this'll be important for you, I hope you'll be willing to
do at least some of the thinking.

While I stand by my statement that unicode is the Right Way to do textin python, this particular feature isn't really that important, and Ican see there are cases where it might cause problems or make life moredifficult. I suspect that I won't really know whether I want thewarning anyway before I've actually tried to port any nuanced, realtext-processing code to 3.0, and it looks like it's going to be a littlewhile before that happens. I suspect that if I do want the warning, itwould be a feature for 2.7, not 2.6, so I don't want to waste a lot ofeveryone's time advocating for it.

Now for a nearly irrelevant digression (please feel free to stop readinghere):

Now, ad-hoc code with a fast and loose definition of "text" can still
read arrays of bytes off a socket without specifying an encoding andget
away with it, but that's because Python's unicode implementation has
thus far been very forgiving, not because the data is cleanly textyet.
I would say that depends on the application, and on arrangements that
client and server may have made off-line about the encoding.

I can see your point. I think it probably holds better on files andstreams than on sockets, though - please forgive me if I don't thinkthat server applications which require environment-dependent out-of-bandarrangements about locale are correct :).

In 2.x, text can legitimately be represented as str -- there's even
the locale module to further specify how it is to be interpreted as
characters.

I'm aware that this specific example is kind of a ridiculous stretch,but it's the first one that came to mind. Considerlen(u'�'.encode('utf-8').rjust(5).decode('utf-8')). Of courseunicode.rjust() won't do the right thing in the case of surrogate pairs,not to mention RTL text, but it still handles a lot more cases thanstr.rjust(), since code points behave a lot more like characters thancode units do.

Sure, this doesn't work for full unicode, and it doesn't work for all
protocols used with sockets, but claiming that only fast and loose
code ever uses str to represent text is quite far from reality -- this
would be saying that the locale module is only for quick and dirty
code, which just ain't so.

It would definitely be overreaching to say all code that uses str isquick and dirty. But I do think that it fits into one of twocategories: quick and dirty, or legacy. locale is an example of alegacy case for which there is no replacement (that I'm aware of). Evenif I were writing a totally unicode-clean application, as far as I'maware, there's no common replacement for i.e. locale.currency().

Still, locale is limiting. It's ... uncomfortable to calllocale.currency() in a multi-user server process. It would be nice ifthere were a replacement that completely separated encoding issues fromlocalization issues.

I believe that a constraint should be that by default (without -3 or a
__future__ import) str and bytes should be the same thing. Or, another
way of looking at this, reads from binary files and reads from sockets
(and other similar things, like ctypes and mmap and the struct module,
for example) should return str instances, not instances of a str
subclass by default -- IMO returning a subclass is bound to break too
much code. (Remember that there is still *lots* of code out there that
uses "type(x) is types.StringType)" rather than "isinstance(x, str)",
and while I'd be happy to warn about that in -3 mode if we could, I
think it's unacceptable to break that in the default environment --
let it break in 3.0 instead.)

I agree. But, it's precisely because this is so subtle that it would benice to have tools which would report warnings to help fix it.*Certainly* by default, everywhere that's "str" in 2.5 should be "str"in 2.6. Probably even in -3 mode, if the goal there is "warnings only".However, the feature still strikes me as potentially useful whileporting. If I were going to advocate for it, though, it would be as aseparate option, e.g. "--separate-bytes-type". I say this as separatefrom just trying to run the code on 3.0 to see what happens because itseems like the most subtle and difficult aspect of the port to getright; it would be nice to be able to tweak it individually, without theother issues related to 3.0. For example, some of the code I work onhas a big stack of dependencies. Some of those are in C, most of themdon't process text at all. However, most of them aren't going to portto 3.0 very early, but it would be good to start running in as 3.0-likeof an environment as possible earlier than that so that the hard stuffis done by the time the full stack has been migrated.

I've written lots of code that
aggressively rejects str() instances as text, as well as unicode
instances as bytes, and that's in code that still supports 2.3 ;).


Yeah, well, but remember, while keeping you happy is high on my list


Thanks, good to hear :)

of priorities, it's not the only priority. :-)

I don't think it's even my fianc�e's *only* priority, and I think itshould stay higher on her list than yours ;-).

_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] What to do for bytes in 2.6?

Reply via email to