On Mon, Sep 20, 2010 at 10:12 PM, Chris McDonough <chr...@plope.com> wrote: > urllib.parse.urlparse/urllib.parse.urlsplit will never need to decode > anything when passed bytes input.
Correct. Supporting manipulation of bytes directly is primarily a speed hack for when an application wants to avoid the decoding/encoding overhead needed to perform the operations in the text domain when the fragments being manipulated are all already correctly encoded ASCII text. However, supporting direct manipulation of bytes *implicitly* in the current functions is problematic, since it means that the function may fail silently when given bytes that are encoded with an ASCII incompatible codec (or which there are many, especially when it comes to multibyte codecs and other stateful codecs). Even ASCII compatible codecs are a potential source of hard to detect bugs, since using different encodings for different fragments will lead directly to mojibake. Moving the raw bytes support out to separate APIs allows their constraints to be spelled out clearly and for programmers to make a conscious decision that that is what they want to do. The onus is then on the programmer to get their encodings correct. If we decide to add implicit support later, that's pretty easy (just have urllib.parse.* delegate to urllib.parse.*b when given bytes). Taking implicit support *away* after providing it, however, means going through the whole deprecation song and dance. Given the choice, I prefer the API design that allows me to more easily change my mind later if I decide I made the wrong call. > There's effectively already a "shadow" bytes-only API in the urlparse module > in the form of the *_to_bytes and *_from_bytes functions in most places where > it counts. If by "most places where it counts" you mean "quote" and "unquote", then sure. However, those two functions predate most of the work on fixing the bytes/unicode issues in the OS facing libraries, so they blur the lines more than may be desirable (although reading http://bugs.python.org/issue3300 shows that there were a few other constraints in play when it comes to those two operations, especially those related to the encoding of the original URL *prior* to percent-encoding for transmission over the wire). Regardless, quoteb and unquoteb will both be strictly bytes->bytes functions, whereas the existing quoting APIs attempt to deal with both text encoding and URL quoting all at the same time (and become a fair bit more complicated as a result). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com