Re: [Python-Dev] bytes / unicode

Glyph Lefkowitz Tue, 22 Jun 2010 00:01:49 -0700

On Jun 21, 2010, at 10:58 PM, Stephen J. Turnbull wrote:

> The RFC says that URIs are text, and therefore they can (and IMO
> should) be operated on as text in the stdlib.



No, *blue* is the best color for a shed.

Oops, wait, let me try that again.

While I broadly agree with this statement, it is really an oversimplification.  
An URI is a structured object, with many different parts, which are transformed 
from bytes to ASCII (or something latin1-ish, which is really just bytes with a 
nice face on them) to real, honest-to-goodness text via the IRI specification: 
<http://tools.ietf.org/html/rfc3987>.

> Note also that the "complete solution" argument cuts both ways.  Eg, a
> "complete" solution should implement UTS 39 "confusables detection"[1]
> and IDNA[2].  Good luck doing that with bytes!

And good luck doing that with just characters, too.  You need a parsed 
representation of the URI that you can encode different parts of in different 
ways.  (My understanding is that you should only really implement confusables 
detection in the netloc... while that may be a bogus example, you're certainly 
only supposed to do IDNA in the netloc!)

You can just call urlsplit() all over the place to emulate this, but this does 
not give you the ability to go back to the original bytes, and thereby preserve 
things like brokenly-encoded segments, which seems to be what a lot of this 
hand-wringing is about.

To put it another way, there is no possible information-preserving string or 
bytes type that will make everyone happy as a result from urljoin().  The only 
return-type that gives you *everything* is "URI".

> just using 'latin-1' as the encoding allows you to
> use the (unicode) string operations internally, and then spew your
> mess out into the world for someone else to clean up, just as using
> bytes would.

This is the limitation that everyone seems to keep dancing around.  If you are 
using the stdlib, with functions that operate on sequences like 'str' or 
'bytes', you need to choose from one of three options:

  1. "decode" everything to latin1 (although I prefer to call it "charmap" when 
used in this way) so that you can have some mojibake that will fool a function 
that needs a unicode object, but not lose any information about your input so 
that it can be transformed back into exact bytes (and be very careful to never 
pass it somewhere that it will interact with real text!),
  2. actually decode things to an appropriate encoding to be displayed to the 
user and manipulated with proper text-manipulation tools, and throw away 
information about the bytes,
  3. keep both the bytes and the characters together (perhaps in a data 
structure) so that you can both display the data and encode it in 
situationally-appropriate ways.

The stdlib as it is today is not going to handle the 3rd case for anyone.  I 
think that's fine; it is not the stdlib's job to solve everyone's problems.  
I've been happy with it providing correctly-functioning pieces that can be used 
to build more elaborate solutions.  This is what I meant when I said I agree 
with Stephen's first point: the stdlib *should* just keep operating entirely on 
strings, because URIs are defined, by the spec, to be sequences of ASCII 
characters.  But that's not the whole story.

PJE's "bstr" and "ebytes" proposals set my teeth on edge.  I can totally 
understand the motivation for them, but I think it would be a big step 
backwards for python 3 to succumb to that temptation, even in the form of a 
third-party library.  It is really trying to cram more information into a pile 
of bytes than truly exists there.  (Also, if we're going to have encodings 
attached to bytes objects, I would very much like to add "JPEG" and "FLAC" to 
the list of possibilities.)

The real tension there is that WSGI is desperately trying to avoid defining any 
data structures (i.e. classes), while still trying to work with structured 
data.  An URI class with a 'child' method could handily solve this problem.  
You could happily call IRI(...).join(some bytes).join(some text) and then just 
say "give me some bytes, it's time to put this on the network", or "give me 
some characters, I have to show something to the user", or even "give me some 
characters appropriate for an 'href=' target in some HTML I'm generating" - 
although that last one could be left to the HTML generator, provided it could 
get enough information from the URI/IRI object's various parts itself.

I don't mean to pick on WSGI, either.  This is a common pain-point for porting 
software to 3.x - you had a string, it kinda worked most of the time before, 
but now you need to keep track of text too and the functions which seemed to 
work on bytes no longer do.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to