Nick Coghlan <ncogh...@gmail.com> added the comment:

I've been pondering the idea of adopting a more conservative approach here, 
since there are actually two issues:

1. Properly quoted URLs are transferred as pure 7-bit ASCII (due to 
percent-encoding of everything else). However, most of the manipulation 
functions in urllib.parse can't handle bytes at all, even data that is 7-bit 
clean.

2. In the real world, just like email, URLs will often contain unescaped (or 
incorrectly escaped) characters. So assuming the input is actually pure ASCII 
isn't necessarily a valid assumption.

I'm wondering, since encoding (aside from quoting) isn't urllib.parse's 
problem, maybe what I should be looking at doing is just handling bytes input 
via an implicit ascii conversion in strict mode (and then conversion back when 
the processing is complete).

Then bytes manipulation of properly quoted URLs will "just work", while 
improperly quoted URLs will fail noisily. This isn't like email or http where 
the protocol contains encoding information that the library should be trying to 
interpret - we're just being given raw bytes without any context information.

If any application wants to be more permissive than that, it can do its own 
conversion to a string and then use the text-based processing. I'll add 
"encode" methods to the result objects to make it easy to convert their 
contents from str to bytes and vice-versa.

I'll factor out the implicit encoding/decoding such that if we decide to change 
the model later (ASCII-strict, ASCII-escape, latin-1) it shouldn't be too 
difficult.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue9873>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to