OK, you've convinced me. But for backwards compatibility (until Python 3000), a new API should be designed. We can't change the old API in an incompatible way. Please submit complete code + docs to SF. (If you think this requires much design work, a PEP may be in order but I think that given the new RFCs it's probably straightforward enough to not require that.
--Guido On 11/27/05, Mike Brown <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > IIRC I did it this way because the RFC about parsing urls specifically > > prescribed it had to be done this way. > > That was true as of RFC 1808 (1995-1998), although the grammar actually > allowed for a more generic interpretation. > > Such an interpretation was suggested in RFC 2396 (1998-2004) via a regular > expression for parsing URI 'references' (a formal abstraction introduced in > 2396) into 5 components (not six, since 'params' were moved into 'path' > and eventually became an option on every path segment, not just the end > of the path). The 5 components are: > > scheme, authority (formerly netloc), path, query, fragment. > > Parsing could result in some components being undefined, which is distinct > from being empty (e.g., 'mailto:[EMAIL PROTECTED]' would have an undefined > authority > and fragment, and a defined, but empty, query). > > RFC 3986 / STD 66 (2005-) did not change the regular expression, but makes > several references to these '5 major components' of a URI, and says that these > components are scheme-independent; parsers that operate at the generic syntax > level "can parse any URI reference into its major components. Once the scheme > is determined, further scheme-specific parsing can be performed on the > components." > > > You have to know what the scheme means before you can > > parse the rest -- there is (by design!) no standard parsing for > > anything that follows the scheme and the colon. > > Not since 1998, IMHO. It was implicit, at least since RFC 2396, that all URI > references can be interpreted as having the 5 components, it was made explicit > in RFC 3986 / STD 66. > > > I don't even think > > that you can trust that if the colon is followed by two slashes that > > what follows is a netloc for all schemes. > > You can. > > > But if there's an RFC that says otherwise I'll gladly concede; > > urlparse's main goal in life is to b RFC compliant. > > Its intent seems to be to split a URI into its major components, which are now > by definition scheme-independent (and have been, implicitly, for a long time), > so the function shouldn't distinguish between schemes. > > Do you want to keep returning that 6-tuple, or can we make it return a > 5-tuple? If we keep returning 'params' for backward compatibility, then that > means the 'path' we are returning is not the 'path' that people would expect > (they'll have to concatenate path+params to get what the generic syntax calls > a 'path' nowadays). It's also deceptive because params are now allowed on all > path segments, and the current function only takes them from the last segment. > > Also for backward compatibility, should an absent component continue to > manifest in the result as an empty string? I think a compliant parser should > make a distinction between absent and empty (it could make a difference, in > theory). > > If a regular expression were used for parsing, it would produce None for > absent components and empty-string for empty ones. I implemented it this > way in 4Suite's Ft.Lib.Uri and it works nicely. > > Mike > -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com