On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman <[EMAIL PROTECTED]> wrote: > On Sun, Dec 7, 2008 at 11:35, Adam Olsen <[EMAIL PROTECTED]> wrote: >>>> http://bugs.python.org/issue3672 >>>> http://bugs.python.org/issue3297 >> >> No. Unicode *requires* them to be treated as errors. If you want to >> pass them through then you're creating a custom encoding... which you >> might argue for in this case, but it needs to be clearly separate from >> the real UTF-8. > > I suspect it is a common and convenient but (according to what you > say) misconceived expectation that using UTF-8 to encode any Unicode > string will not raise an exception. This behavior is not something > which should be discarded lightly.
It is *not* a valid Unicode string in the first place. Therein lies the problem. > I see little reason that this couldn't be a new codec or error handler > that allowed people to choose between correct pure UTF-8 behavior or > the technically incorrect but very practical behavior it currently > has. Note that many of the restrictions were added for security reasons. You might receive a UTF-8 encoded file name from a malicious user, check if it contains something dangerous (like "../../../../../etc/password"), then decode it. If your decoder isn't compliant (ie doesn't check for overly long sequences) then a b'\xC0\xAF' gets translated into u'/', bypassing your previous check. However, in this context we only need to allow lone surrogates. CESU-8 comes to mind. (It is a perverse world we live in.) -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com