On Thu, Jun 24, 2010 at 3:59 PM, Guido van Rossum <gu...@python.org> wrote:
> The protocol specs typically go out of their way to specify what byte > values they use for syntactically significant positions (e.g. ':' in > headers, or '/' in URLs), while hand-waving about the meaning of "what > goes in between" since it is all typically treated as "not of > syntactic significance". So you can write a parser that looks at bytes > exclusively, and looks for a bunch of ASCII punctuation characters > (e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff > in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks > "inside" stretches of characters between the special characters and > just copies them. (Sometimes there may be *some* sections that are > required to be ASCII and there equivalence of a-z and A-Z is well > defined.) > Yes, these are the specific characters that I think we can handle specially. For instance, the list of all string literals used by urlsplit and urlunsplit: '//' '/' ':' '?' '#' '' 'http' A list of all valid scheme characters (a-z etc) Some lists for scheme-specific parsing (which all contain valid scheme characters) All of these are constrained to ASCII, and must be constrained to ASCII, and everything else in a URL is treated as basically opaque. So if we turned these characters into byte-or-str objects I think we'd basically be true to the intent of the specs, and in a practical sense we'd be able to make these functions polymorphic. I suspect this same pattern will be present most places where people want polymorphic behavior. For now we could do something incomplete and just avoid using operators we can't overload (is it possible to at least make them produce a readable exception?) I think we'll avoid a lot of the confusion that was present with Python 2 by not making the coercions transitive. For instance, here's something that would work in Python 2: urlunsplit(('http', 'example.com', '/foo', u'bar=baz', '')) And you'd get out a unicode string, except that would break the first time that query string (u'bar=baz') was not ASCII (but not until then!) Here's the urlunsplit code: def urlunsplit(components): scheme, netloc, url, query, fragment = components if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'): if url and url[:1] != '/': url = '/' + url url = '//' + (netloc or '') + url if scheme: url = scheme + ':' + url if query: url = url + '?' + query if fragment: url = url + '#' + fragment return url If all those literals were this new special kind of string, if you call: urlunsplit((b'http', b'example.com', b'/foo', 'bar=baz', b'')) You'd end up constructing the URL b'http://example.com/foo' and then running: url = url + special('?') + query And that would fail because b'http://example.com/foo' + special('?') would be b'http://example.com/foo?' and you cannot add that to the str 'bar=baz'. So we'd be avoiding the Python 2 craziness. -- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com