On Tue, 2010-09-21 at 08:19 +1000, Nick Coghlan wrote: > On Tue, Sep 21, 2010 at 7:39 AM, Chris McDonough <chr...@plope.com> wrote: > > On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote: > >> On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chr...@plope.com> wrote: > >> > Existing APIs save for "quote" don't really need to deal with charset > >> > encodings at all, at least on any level that Python needs to care about. > >> > The potential already exists to emit garbage which will turn into > >> > mojibake from almost all existing APIs. The only remaining issue seems > >> > to be fear of making a design mistake while designing APIs. > >> > > >> > IMO, having a separate module for all urllib.parse APIs, each designed > >> > for only bytes input is a design mistake greater than any mistake that > >> > could be made by allowing for both bytes and str input to existing APIs > >> > and returning whatever type was passed. The existence of such a module > >> > will make it more difficult to maintain a codebase which straddles > >> > Python 2 and Python 3. > >> > >> Failure to use quote/unquote correctly is a completely different > >> problem from using bytes with an ASCII incompatible encoding, or > >> mixing bytes with different encodings. Yes, if you don't quote your > >> URLs you may end up with mojibake. That's not a justification for > >> creating a *new* way to accidentally create mojibake. > > > > There's no new way to accidentally create new mojibake here by allowing > > bytes input, as far as I can tell. > > > > - If a user passes something that has character data outside the range > > 0-127 to an API that expects a URL or a "component" (in the > > definition that urllib.parse.urlparse uses for "component") of a URI, > > he can keep both pieces when it breaks. Whether that data is > > represented via bytes or text is not relevant. He provided > > bad input, he is going to lose one way or another. > > > > - If a user passes a bytestring to ``quote``, because ``quote`` is > > implemented in terms of ``quote_to_bytes`` the case is *already* > > handled by quote_to_bytes implicitly failing to convert nonascii > > characters. > > > > What are the cases you believe will cause new mojibake? > > Calling operations like urlsplit on byte sequences in non-ASCII > compatible encodings and operations like urljoin on byte sequences > that are encoded with different encodings. These errors differ from > the URL escaping errors you cite, since they can produce true mojibake > (i.e. a byte sequence without a single consistent encoding), rather > than merely non-compliant URLs. However, if someone has let their > encodings get that badly out of whack in URL manipulation they're > probably doomed anyway...
Right, the bytes issue here is really a red herring in both the urlsplit and urljoin cases, I think. > It's certainly possible I hadn't given enough weight to the practical > issues associated with migration of existing code from 2.x to 3.x > (particularly with the precedent of some degree of polymorphism being > set back when Issue 3300 was dealt with). > > Given that a separate API still places the onus on the developer to > manage their encodings correctly, I'm beginning to lean back towards > the idea of a polymorphic API rather than separate functions. (the > quote/unquote legacy becomes somewhat unfortunate in that situation, > as they always returns str objects rather than allowing the type of > the result to be determined by the type of the argument. Something > like quotep/unquotep may prove necessary in order to work around that > situation and provide a bytes->bytes, str->str API) Yay, sounds much, much better! - C _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com