Guido van Rossum wrote: > IIRC I did it this way because the RFC about parsing urls specifically > prescribed it had to be done this way.
That was true as of RFC 1808 (1995-1998), although the grammar actually allowed for a more generic interpretation. Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular expression for parsing URI 'references' (a formal abstraction introduced in 2396) into 5 components (not six, since 'params' were moved into 'path' and eventually became an option on every path segment, not just the end of the path). The 5 components are: scheme, authority (formerly netloc), path, query, fragment. Parsing could result in some components being undefined, which is distinct from being empty (e.g., 'mailto:[EMAIL PROTECTED]' would have an undefined authority and fragment, and a defined, but empty, query). RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes several references to these '5 major components' of a URI, and says that these components are scheme-independent; parsers that operate at the generic syntax level "can parse any URI reference into its major components. Once the scheme is determined, further scheme-specific parsing can be performed on the components." > You have to know what the scheme means before you can > parse the rest -- there is (by design!) no standard parsing for > anything that follows the scheme and the colon. Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI references can be interpreted as having the 5 components, it was made explicit in RFC 3986 / STD 66. > I don't even think > that you can trust that if the colon is followed by two slashes that > what follows is a netloc for all schemes. You can. > But if there's an RFC that says otherwise I'll gladly concede; > urlparse's main goal in life is to b RFC compliant. Its intent seems to be to split a URI into its major components, which are now by definition scheme-independent (and have been, implicitly, for a long time), so the function shouldn't distinguish between schemes. Do you want to keep returning that 6-tuple, or can we make it return a 5-tuple? If we keep returning 'params' for backward compatibility, then that means the 'path' we are returning is not the 'path' that people would expect (they'll have to concatenate path+params to get what the generic syntax calls a 'path' nowadays). It's also deceptive because params are now allowed on all path segments, and the current function only takes them from the last segment. Also for backward compatibility, should an absent component continue to manifest in the result as an empty string? I think a compliant parser should make a distinction between absent and empty (it could make a difference, in theory). If a regular expression were used for parsing, it would produce None for absent components and empty-string for empty ones. I implemented it this way in 4Suite's Ft.Lib.Uri and it works nicely. Mike _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com