On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <na...@animats.com> wrote: > On 9/18/2010 2:29 AM, python-dev-requ...@python.org wrote: >> >> Polymorphic best practices [was: (Not) delaying the 3.2 release] > > If you're hung up on this, try writing the user-level documentation > first. Your target audience is a working-level Web programmer, not > someone who knows six programming languages and has a CS degree. > If the explanation is too complex, so is the design. > > Coding in this area is quite hard to do right. There are > issues with character set, HTML encoding, URL encoding, and > internationalized domain names. It's often done wrong; > I recently found a Google service which botched it. > Python libraries should strive to deliver textual data to the programmer > in clean Unicode. If someone needs the underlying wire representation > it should be available, but not the default.
Even though URL byte sequences are defined as using only an ASCII subset, I'm currently inclined to add raw bytes supports to urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb, etc) rather than doing it implicitly in the normal functions. My rationale is as follows: - while URLs are *meant* to be encoded correctly as an ASCII subset, the real world isn't always quite so tidy (i.e. applications treat as URLs things that technically are not because the encoding is wrong) - separating the APIs forces the programmer to declare that they know they're working with the raw bytes off the wire to avoid the decode/encode overhead that comes with working in the Unicode domain - easier to change our minds later. Adding implicit bytes support to the normal names can be done any time, but removing it would require an extensive deprecation period Essentially, while I can see strong use cases for wanting to manipulate URLs in wire format, I *don't* see strong use cases for manipulating URLs without *knowing* whether they're in wire format (encoded bytes) or display format (Unicode text). For some APIs that work for arbitrary encodings (e.g. os.listdir) switching based on argument type seems like a reasonable idea. For those that may silently produce incorrect output for ASCII-incompatible encodings, the os.environ/os.environb seems like a better approach. I could probably be persuaded to merge the APIs, but the email6 precedent suggests to me that separating the APIs better reflects the mental model we're trying to encourage in programmers manipulating text (i.e. the difference between the raw octet sequence and the text character sequence/parsed data). Cheers, Nick. -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com