On Sun, 2010-09-19 at 12:03 +1000, Nick Coghlan wrote: > On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <na...@animats.com> wrote: > > On 9/18/2010 2:29 AM, python-dev-requ...@python.org wrote: > >> > >> Polymorphic best practices [was: (Not) delaying the 3.2 release] > > > > If you're hung up on this, try writing the user-level documentation > > first. Your target audience is a working-level Web programmer, not > > someone who knows six programming languages and has a CS degree. > > If the explanation is too complex, so is the design. > > > > Coding in this area is quite hard to do right. There are > > issues with character set, HTML encoding, URL encoding, and > > internationalized domain names. It's often done wrong; > > I recently found a Google service which botched it. > > Python libraries should strive to deliver textual data to the programmer > > in clean Unicode. If someone needs the underlying wire representation > > it should be available, but not the default. > > Even though URL byte sequences are defined as using only an ASCII > subset, I'm currently inclined to add raw bytes supports to > urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb, > etc) rather than doing it implicitly in the normal functions. > > My rationale is as follows: > - while URLs are *meant* to be encoded correctly as an ASCII subset, > the real world isn't always quite so tidy (i.e. applications treat as > URLs things that technically are not because the encoding is wrong) > - separating the APIs forces the programmer to declare that they know > they're working with the raw bytes off the wire to avoid the > decode/encode overhead that comes with working in the Unicode domain > - easier to change our minds later. Adding implicit bytes support to > the normal names can be done any time, but removing it would require > an extensive deprecation period > > Essentially, while I can see strong use cases for wanting to > manipulate URLs in wire format, I *don't* see strong use cases for > manipulating URLs without *knowing* whether they're in wire format > (encoded bytes) or display format (Unicode text). For some APIs that > work for arbitrary encodings (e.g. os.listdir) switching based on > argument type seems like a reasonable idea. For those that may > silently produce incorrect output for ASCII-incompatible encodings, > the os.environ/os.environb seems like a better approach.
urllib.parse.urlparse/urllib.parse.urlsplit will never need to decode anything when passed bytes input. Both could just put the bytes comprising the hex-encoded components (the path and query string) into its respective place in the parse results, just like it does now for string input. As far as I can tell, the only thing preventing it from working against bytes right now is the use of string literals in the source instead of input-type-dictated-literals. There should not really be any need to create a "urllib.parse.urlsplitb" unless the goal is to continue down the (not great IMO) precedent already set by the shadow bytes API in urllib.parse (*_to_bytes, *_from_bytes) or if we just want to make it deliberately harder to parse URLs. The only decoding that needs to be done to potential bytes input by APIs in urllib.parse will be in the face of percent encodings in the path and query components (handled entirely by "unquote" and "unquote_plus", which already deal in bytes under the hood). The only encoding that needs to be done by urllib.parse is in the face of input to the "urlencode" and "quote" APIs. "quote" already deals with bytes as input under the hood. "urlencode" does not, but it might be changed use the same strategy that "quote" does now (by using a "urlencode_to_bytes" under the hood). However, I think any thought about "adding raw bytes support" is largely moot at this point. This pool has already been peed in.There's effectively already a "shadow" bytes-only API in the urlparse module in the form of the *_to_bytes and *_from_bytes functions in most places where it counts. So as I see it, the options are: 1) continue the *_to_bytes and *_from_bytes pattern as necessary. 2) create a new module (urllib.parse2) that has only polymorphic functions. #1 is not very pleasant to think about as a web developer if I need to maintain a both-2-and-3-compatible codebase. Neither is #2, really, if I had to support Python 3.1 and 3.2. From my (obviously limited) perspective, a more attractive third option is backwards incompatibility in a later Python 3 version, where encoding-aware functions like quote, urlencode, and unquote_plus were polymorphic, accepting both bytes and string objects and returning same-typed data. - C _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com