On 04:26 am, [EMAIL PROTECTED] wrote:
On Jan 19, 2008 5:54 PM, <[EMAIL PROTECTED]> wrote:
On 19 Jan, 07:32 pm, [EMAIL PROTECTED] wrote:
Starting with the most relevant bit before getting off into digressions
that may not interest most people:
Why can't we get that warning in -3 mode just the same from something
read from a socket and a b"" literal?
If you really want this, please think through all the consequences,
and report back here. While I have a hunch that it'll end up giving
too many false positives and at the same time too many false
negatives, perhaps I haven't thought it through enough. But if you
really think this'll be important for you, I hope you'll be willing to
do at least some of the thinking.
While I stand by my statement that unicode is the Right Way to do text
in python, this particular feature isn't really that important, and I
can see there are cases where it might cause problems or make life more
difficult. I suspect that I won't really know whether I want the
warning anyway before I've actually tried to port any nuanced, real
text-processing code to 3.0, and it looks like it's going to be a little
while before that happens. I suspect that if I do want the warning, it
would be a feature for 2.7, not 2.6, so I don't want to waste a lot of
everyone's time advocating for it.
Now for a nearly irrelevant digression (please feel free to stop reading
here):
Now, ad-hoc code with a fast and loose definition of "text" can still
read arrays of bytes off a socket without specifying an encoding and
get
away with it, but that's because Python's unicode implementation has
thus far been very forgiving, not because the data is cleanly text
yet.
I would say that depends on the application, and on arrangements that
client and server may have made off-line about the encoding.
I can see your point. I think it probably holds better on files and
streams than on sockets, though - please forgive me if I don't think
that server applications which require environment-dependent out-of-band
arrangements about locale are correct :).
In 2.x, text can legitimately be represented as str -- there's even
the locale module to further specify how it is to be interpreted as
characters.
I'm aware that this specific example is kind of a ridiculous stretch,
but it's the first one that came to mind. Consider
len(u'é'.encode('utf-8').rjust(5).decode('utf-8')). Of course
unicode.rjust() won't do the right thing in the case of surrogate pairs,
not to mention RTL text, but it still handles a lot more cases than
str.rjust(), since code points behave a lot more like characters than
code units do.
Sure, this doesn't work for full unicode, and it doesn't work for all
protocols used with sockets, but claiming that only fast and loose
code ever uses str to represent text is quite far from reality -- this
would be saying that the locale module is only for quick and dirty
code, which just ain't so.
It would definitely be overreaching to say all code that uses str is
quick and dirty. But I do think that it fits into one of two
categories: quick and dirty, or legacy. locale is an example of a
legacy case for which there is no replacement (that I'm aware of). Even
if I were writing a totally unicode-clean application, as far as I'm
aware, there's no common replacement for i.e. locale.currency().
Still, locale is limiting. It's ... uncomfortable to call
locale.currency() in a multi-user server process. It would be nice if
there were a replacement that completely separated encoding issues from
localization issues.
I believe that a constraint should be that by default (without -3 or a
__future__ import) str and bytes should be the same thing. Or, another
way of looking at this, reads from binary files and reads from sockets
(and other similar things, like ctypes and mmap and the struct module,
for example) should return str instances, not instances of a str
subclass by default -- IMO returning a subclass is bound to break too
much code. (Remember that there is still *lots* of code out there that
uses "type(x) is types.StringType)" rather than "isinstance(x, str)",
and while I'd be happy to warn about that in -3 mode if we could, I
think it's unacceptable to break that in the default environment --
let it break in 3.0 instead.)
I agree. But, it's precisely because this is so subtle that it would be
nice to have tools which would report warnings to help fix it.
*Certainly* by default, everywhere that's "str" in 2.5 should be "str"
in 2.6. Probably even in -3 mode, if the goal there is "warnings only".
However, the feature still strikes me as potentially useful while
porting. If I were going to advocate for it, though, it would be as a
separate option, e.g. "--separate-bytes-type". I say this as separate
from just trying to run the code on 3.0 to see what happens because it
seems like the most subtle and difficult aspect of the port to get
right; it would be nice to be able to tweak it individually, without the
other issues related to 3.0. For example, some of the code I work on
has a big stack of dependencies. Some of those are in C, most of them
don't process text at all. However, most of them aren't going to port
to 3.0 very early, but it would be good to start running in as 3.0-like
of an environment as possible earlier than that so that the hard stuff
is done by the time the full stack has been migrated.
I've written lots of code that
aggressively rejects str() instances as text, as well as unicode
instances as bytes, and that's in code that still supports 2.3 ;).
Yeah, well, but remember, while keeping you happy is high on my list
Thanks, good to hear :)
of priorities, it's not the only priority. :-)
I don't think it's even my fiancée's *only* priority, and I think it
should stay higher on her list than yours ;-).
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe:
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com