Re: [Python-Dev] bytes / unicode

Stephen J. Turnbull Mon, 21 Jun 2010 21:53:00 -0700

Robert Collins writes:

 > Perhaps you mean 3986 ? :)


Thank you for the correction.

 > >    A URI is an identifier consisting of a sequence of characters
 > >    matching the syntax rule named <URI> in Section 3.
 > >
 > > (where the phrase "sequence of characters" appears in all ancestors I
 > > found back to RFC 1738), and
 > 
 > Sure, ok, let me unpack what I meant just a little. An abstract URI is
 > neither unicode nor bytes per se - see section 1.2.1 " A URI is a
 > sequence of characters from a very limited set: the letters of the
 > basic Latin alphabet, digits, and a few special characters. "

My position is that this describes the network protocol, not the
abstract URI.  It in no way suggests that uri-encoded forms should be
handled internally.  And the RFC explicitly says this is text, and
therefore sanctions the user- and programmer-friendly practice of
doing internal processing as text.

Note that in a hypothetical bytes-oriented API

    base = convert_uri_to_wire_format('http://www.example.org/')
    formuri = uri_join(base,b'home/steve/public_html')

the bytes literal b'/home/steve/public_html' clearly is intended as
readable text.  This is mixing types in the programmer's mind, even
though base is internally in bytes format and the relative URI is also
in bytes format.  This is un-Pythonic IMO.

 > URI interpretation is fairly strictly separated between producers and
 > consumers. A consumer can manipulate a url with other url fragments -
 > e.g. doing urljoin. But it needs to keep the url as a url and not try
 > to decode it to a unicode representation.

Unfortunately, outside of Kansas and Canberra, it don't work that
way.  How do you propose to uri_join base as above and
'/home/スティーブ/public_html'?  Encoding and/or decoding must be done
somewhere, and it would be damn unfriendly to make the browser user do
it!

In the bytes-oriented API, the programmer must be continually making
decisions about whether and how to handle non-ASCII components from
"outside" (or, more likely, cursing the existence of the damned
foreigners, and then ignoring the possibility ... let them eat
UnicodeException!)

 > As an example, if I give the uri "http://server/%c3%83";, rendering
 > that as http://server/Ã is able to lead to transcription errors and
 > reinterpretation problems unless you know - out of band - that the
 > server is using utf8 to encode. Conversely if someone enters in
 > http://server/Ã in their browser window, choosing utf8 or their local
 > encoding is quite arbitrary and able to not match how the server would
 > represent that resource.

Sure.  Using bytes doesn't solve either problem.  It just allows you
to wash your hands of it and pass it on to someone else, who probably
has even less information than you do.

Eg, in the case of passing the uri "http://server/%c3%83"; to someone
else without telling them the encoding means that effectively they're
limited to ASCII if they want to append meaningful relative paths
without guessing the encoding.

In the case of the user entering "http://server/Ã";, you have to do
*something* to produce bytes eventually.  When was the last time you
typed "%c3%83" at the end of a URL in a browser address field?

 > >    2.  Characters
 > >
 > >    The URI syntax provides a method of encoding data, presumably for
 > >    the sake of identifying a resource, as a sequence of characters.
 > >    The URI characters are, in turn, frequently encoded as octets for
 > >    transport or presentation.  This specification does not mandate any
 > >    particular character encoding for mapping between URI characters
 > >    and the octets used to store or transmit those characters.  When a
 > >    URI appears in a protocol element, the character encoding is
 > >    defined by that protocol; without such a definition, a URI is
 > >    assumed to be in the same character encoding as the surrounding
 > >    text.
 > 
 > Thats true, but its been taken out of context; the set of characters
 > permitted in a URL is a strict subset of characters found in  ASCII;

No.  Again, you're confounding "the URL" with its network format.
There's no question that the network format is in bytes, and before
putting the URI into a wire protocol, you need to encode non-URI
characters.  However, the abstract URI is text, and may not even be
represented by octets or Unicode at all (eg, represented by carbon
residue on recycled wood pulp).

 > See also the section on comparing URL's - Unicode isn't at all relevant.

Not to the RFC, which talks about *characters* and gives examples that
imply transcoding (eg, between EBCDIC and UTF-16), see the section you
cite.  However, Unicode is the canonical representation of text inside
Python, and therefore TOOWTDI for URL comparison in Python.

Thank you for that killer argument for my position; I hadn't thought
of it.

 > I wish it would. The problem is not in Python here though - and
 > casually handwaving will exacerbate it, not fix it. 

Using bytes "because we just don't know" is exactly casual handwaving.
Well, maybe not casual; I'm aware that many programmers are driven to
it by the recognition that only the extremes (all bytes vs. all text)
make sense, and they choose bytes for efficiency reasons.

I believe that focus on efficiency is un-Pythonic; that in Python 3
text should be chosen (in the stdlib) because it makes writing
programs more fun (you can use literal notation for non-ASCII string
constants, for example) and debuggable.

Sure, in some cases you'll need to punt to 'latin-1' (ie, 'binary') or
perhaps PEP 383 lone surrogates (this would require special handling
to get reasonably friendly presentation to users and debuggers, I
suppose), but for the many cases where you know that everything is in
the same encoding life is a lot better.  And of course I have no
objection to an additional API for efficiency for those who want it,
and maybe that even belongs in the stdlib.  But IMO the TOOWTDI should
use text (ie, Python 3 str = Unicode) by default.

 > Modelling URL's as string like things is great from a convenience
 > perspective, but, like file paths, they are much more complex
 > difficult.

No.  Like file paths, it is the key to any real solution to the
problem.  Users, both server admins, URN specifiers, and browsers,
think about the URI as text and expect inputting text to work.  As
does the RFC.  Machines, on the other hand, think of both as bytes (at
least in the general Unix world).  It is the programmer's job to do
the best she can to identify the correct encoding to bridge the
mismatch.  She can abdicate that job, of course, but if she chooses
*not* to abdicate, (1) treating the URI as text encourages her to
confront the issue early, and (2) ensures that to the extent possible
the URI will maintain its quality of intelligible text.

With bytes, your only sane choice is to abdicate.

N.B.  STD 66 refrains from redefining HTTP URLs to be UTF-8 because
*it would not work*.  Practically, Nippon Tel & Tel will continue to
use Shift JIS URIs for cellphone-oriented sites because its handset
browsers only understand Shift JIS (or some such nonsense).

 > If Unicode was relevant to HTTP,

Again, Unicode is relevant not because of the wire protocols, but
because of Python's and because of the intent of the RFCs.

 > I'd agree, but its not; we should put fragile heuristics at the
 > outer layer of the API and work as robustly and mechanically as
 > possible at the core. Where we need to guess, we need worker
 > functions that won't guess at all - for the sanity of folk writing
 > servers and protocol implementations.

A worker function that doesn't guess must error in the absence of
out-of-band information about the encoding.  This is true whether you
represent URIs internally as bytes or as text.  Refusing to error
constitutes a guess, because in a bytes-internal system, eventually
text from outside will find its way into the system, and must be
encoded to bytes, and in the case of a text-internal system, obviously
bytes from outside are coming in and must be decoded to text.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to