Re: [Python-Dev] bytes / unicode

Glyph Lefkowitz Tue, 22 Jun 2010 17:31:43 -0700

On Jun 22, 2010, at 2:07 PM, James Y Knight wrote:

> Yeah. This is a real issue I have with the direction Python3 went: it pushes 
> you into decoding everything to unicode early, even when you don't care -- 
> all you really wanted to do is pass it from one API to another, with some 
> well-defined transformations, which don't actually depend on it having being 
> decoded properly. (For example, extracting the path from the URL and 
> attempting to open it as a file on the filesystem.)


But you _do_ need to decode it in this case.  If you got your URL from some 
funky UTF-32 datasource, b"\x00\x00\x00/" is not a path separator, "/" is.  
Plus, you should really be separating path segments and looking at them 
individually so that you don't fall victim to "%2F" bugs.  And if you want your 
code to be portable, you need a Unicode representation of your pathname anyway 
for Windows; plus, there, you need to care about "\" as well as "/".

The fact that your wire-bytes were probably ASCII(-ish) and your filesystem 
probably encodes pathnames as UTF-8 and so everything looks like it lines up is 
no excuse not to be explicit about your expectations there.

You may want to transcode your characters into some other characters later, but 
that shouldn't stop you from treating them as characters of some variety in the 
meanwhile.

> The surrogateescape method is a nice workaround for this, but I can't help 
> thinking that it might've been better to just treat stuff as 
> possibly-invalid-but-probably-utf8 byte-strings from input, through 
> processing, to output. It seems kinda too late for that, though: next time 
> someone designs a language, they can try that. :)

I can think of lots of optimizations that might be interesting for Python (or 
perhaps some other runtime less concerned with cleverness overload, like PyPy) 
to implement, like a UTF-8 combining-characters overlay that would allow for 
fast indexing, lazily populated as random access dictates.  But this could all 
be implemented as smartness inside .encode() and .decode() and the str and 
bytes types without changing the way the API works.

I realize that there are implications at the C level, but as long as you can 
squeeze a function call in to certain places, it could still work.

I can also appreciate what's been said in this thread a bunch of times: to my 
knowledge, nobody has actually shown a profile of an application where encoding 
is significant overhead.  I believe that encoding _will_ be a significant 
overhead for some applications (and actually I think it will be very 
significant for some applications that I work on), but optimizations should 
really be implemented once that's been demonstrated, so that there's a better 
understanding of what the overhead is, exactly.  Is memory a big deal?  Is CPU? 
 Is it both?  Do you want to tune for the tradeoff?  etc, etc.  Clever 
data-structures seem premature until someone has a good idea of all those 
things.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to