Re: [Python-Dev] bytes / unicode

Toshio Kuratomi Tue, 22 Jun 2010 00:01:18 -0700

On Tue, Jun 22, 2010 at 11:58:57AM +0900, Stephen J. Turnbull wrote:
> Toshio Kuratomi writes:
> 
>  > One comment here -- you can also have uri's that aren't decodable into 
> their
>  > true textual meaning using a single encoding.
>  > 
>  > Apache will happily serve out uris that have utf-8, shift-jis, and
>  > euc-jp components inside of their path but the textual
>  > representation that was intended will be garbled (or be represented
>  > by escaped byte sequences).  For that matter, apache will serve
>  > requests that have no true textual representation as it is working
>  > on the byte level rather than the character level.
> 
> Sure.  I've never seen that combination, but I have seen Shift JIS and
> KOI8-R in the same path.
> 
> But in that case, just using 'latin-1' as the encoding allows you to
> use the (unicode) string operations internally, and then spew your
> mess out into the world for someone else to clean up, just as using
> bytes would.
> 
This is true.  I'm giving this as a real-world counter example to the
assertion that URIs are "text".  In fact, I think you're confusing things
a little by asserting that the RFC says that URIs are text.  I'll address
that in two sections down.


>  > So a complete solution really should allow the programmer to pass
>  > in uris as bytes when the programmer knows that they need it.
> 
> Other than passing bytes into a constructor, I would argue if a
> complete solution requires, eg, an interface that allows
> urljoin(base,subdir) where the types of base and subdir are not
> required to match, then it doesn't belong in the stdlib.  For stdlib
> usage, that's premature optimization IMO.
> 
I'll definitely buy that.  Would urljoin(b_base, b_subdir) => bytes and
urljoin(u_base, u_subdir) => unicode be acceptable though?  (I think, given
other options, I'd rather see two separate functions, though.  It seems more
discoverable and less prone to taking bad input some of the time to have two
functions that clearly only take one type of data apiece.)

> The RFC says that URIs are text, and therefore they can (and IMO
> should) be operated on as text in the stdlib.

If I'm reading the RFC correctly, you're actually operating on two different
levels here.  Here's the section 2 that you quoted earlier, now in its
entirety::
2.  Characters

   The URI syntax provides a method of encoding data, presumably for the
   sake of identifying a resource, as a sequence of characters.  The URI
   characters are, in turn, frequently encoded as octets for transport or
   presentation.  This specification does not mandate any particular
   character encoding for mapping between URI characters and the octets used
   to store or transmit those characters.  When a URI appears in a protocol
   element, the character encoding is defined by that protocol; without such
   a definition, a URI is assumed to be in the same character encoding as
   the surrounding text.

   The ABNF notation defines its terminal values to be non-negative integers
   (codepoints) based on the US-ASCII coded character set [ASCII].  Because
   a URI is a sequence of characters, we must invert that relation in order
   to understand the URI syntax.  Therefore, the integer values used by the
   ABNF must be mapped back to their corresponding characters via US-ASCII
   in order to complete the syntax rules.

   A URI is composed from a limited set of characters consisting of digits,
   letters, and a few graphic symbols.  A reserved subset of those
   characters may be used to delimit syntax components within a URI while
   the remaining characters, including both the unreserved set and those
   reserved characters not acting as delimiters, define each component's
   identifying data.

So here's some data that matches those terms up to actual steps in the
process::

  # We start off with some arbitrary data that defines a resource.  This is
  # not necessarily text.  It's the data from the first sentence:
  data = b"\xff\xf0\xef\xe0"

  # We encode that into text and combine it with the scheme and host to form
  # a complete uri.  This is the "URI characters" mentioned in section #2.
  # It's also the "sequence of characters mentioned in 1.1" as it is not
  # until this point that we actually have a URI.
  uri = b"http://host/"; + percentencoded(data)
  # 
  # Note1: percentencoded() needs to take any bytes or characters outside of
  # the characters listed in section 2.3 (ALPHA / DIGIT / "-" / "." / "_"
  # / "~") and percent encode them.  The URI can only consist of characters
  # from this set and the reserved character set (2.2).
  #
  # Note2: in this simplistic example, we're only dealing with one piece of
  # data.  With multiple pieces, we'd need to combine them with separators,
  # for instance like this:
  # uri = b'http://host/' + percentencoded(data1) + b'/'
  # + percentencoded(data2)
  #
  # Note3: at this point, the uri could be stored as unicode or bytes in
  # python3.  It doesn't matter.  It will be a subset of ASCII in either
  # case.

  # Then we take this and encode it for presentation inside of a data
  # file.  If we're saving in any encoding that has ASCII as a subset and we
  # had bytes returned from the previous step, all we need to do is save to
  # a file.  If we had unicode from the previous step, we need to transform
  # to the encoding we're using and output it.
  u_uri.encode('utf8')

With all this in mind... URIs are text according to the RFC if you want to
deal with URIs that are percent encoded.  In other words, things like this::
  http://host/%ff%f0%ef%e0

If you want to deal with things like this::
  http://host/café

Then you are going one step further; back to the orginal data that was
encoded in the RFC.  At that point you are no longer dealing with the
sequence of characters talked about in the RFC.  You are dealing with data
which may or may not be text.

As Robert Collins says, this is bytes by definition which I pretty much
agree with.  It's very very convenient to work with this data as text most
of the time but the RFC does not mandate that it is text so operating on it
as bytes is perfectly reasonable.

> It's not just a matter
> of manipulating the URIs themselves, where working directly on bytes
> will work just as well and and with the same string operations (as
> long as everything is bytes).  It's also a question of API complexity
> (eg, Barry's bugaboo of proliferation of encoding= parameters) and of
> debugging (if URIs are internally str, then they will display sanely
> in tracebacks and the interpreter).

The proliferation of encoding I agree is a thing that is ugly.  Although, if
I'm thinking correctly, that only matters when you want to allow mixing
bytes and unicode, correct?  One of these cases:

* I take in some mix of parameters with at least one unicode and output bytes
* I take in some mix of parameters with at least one bytes and output unicode
* I take in either bytes or unicode and transform them internally to the
  other type before operating on them.  Then I transform the output to the
  input type before returning.

For debugging, I'm either not understanding or you're wrong.  If I'm given
an arbitrary sequence of bytes how do I sanely store them as str internally?
If I transform them using an encoding that anticipates the full range of
bytes I may be able to display some representation of them but it's not
necessarily the sanest method of display (for instance, if I know that path
element 1 is always going to be a utf8 encoded string and path element 2 is
always shift-jis encoded, and path element 3 is binary data, I could
construct a much saner display method than treating the whole thing as
latin1).

> The cases where URIs can't be sanely treated as text are garbage
> input, and the stdlib should not try to provide a solution.  Just
> passing in bytes and getting out bytes is GIGO.  Trying to do "some"
> error-checking is going to be insufficient much of the time and overly
> strict most of the rest of the time.  The programmer in the trenches
> is going to need to decide what to allow and what not; I don't think
> there are general answers because we know that allowing random URLs on
> the web leads to various kinds of problems.  Some sites will need to
> address some of them.
> 
What is your basis for asserting that URIs that aren't sanely treated as
text are garbage?  It's definitely not in the RFC.

> Note also that the "complete solution" argument cuts both ways.  Eg, a
> "complete" solution should implement UTS 39 "confusables detection"[1]
> and IDNA[2].  Good luck doing that with bytes!
> 
Note that IDNA and confusables detection operate on a different portion of
the uri than the need for bytes.  Those operate on the domain name (looks
like it's called the authority in the rfc) whereas bytes are useful for the
path, query, and fragment portions.

Note:  I'm not sure precisely what Philip is looking to do but the little
I've read sounds like its contrary to the design principles of the python3
unicode handling redesign.  I'm stating my reading of the RFC not to defend
the use case Philip has, but because I think that the outlook that non-text
uris (before being percentencoded) are violations of the RFC is wrong and
will lead to interoperability problems/warts(since you could turn them into
latin1 and from there into bytes and from there into the proper values) if
allowed to predominate the thinking.

-Toshio

pgpPHA5JMInxr.pgp
Description: PGP signature

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] bytes / unicode

Reply via email to