Re: [Python-Dev] Python-3.0, unicode, and os.environ

Adam Olsen Sun, 07 Dec 2008 10:56:44 -0800

On Sun, Dec 7, 2008 at 11:18 AM, Michael Urman <[EMAIL PROTECTED]> wrote:
> On Sun, Dec 7, 2008 at 11:35, Adam Olsen <[EMAIL PROTECTED]> wrote:
>>>> http://bugs.python.org/issue3672
>>>> http://bugs.python.org/issue3297
>>
>> No.  Unicode *requires* them to be treated as errors.  If you want to
>> pass them through then you're creating a custom encoding... which you
>> might argue for in this case, but it needs to be clearly separate from
>> the real UTF-8.
>
> I suspect it is a common and convenient but (according to what you
> say) misconceived expectation that using UTF-8 to encode any Unicode
> string will not raise an exception. This behavior is not something
> which should be discarded lightly.


It is *not* a valid Unicode string in the first place.  Therein lies
the problem.


> I see little reason that this couldn't be a new codec or error handler
> that allowed people to choose between correct pure UTF-8 behavior or
> the technically incorrect but very practical behavior it currently
> has.

Note that many of the restrictions were added for security reasons.
You might receive a UTF-8 encoded file name from a malicious user,
check if it contains something dangerous (like
"../../../../../etc/password"), then decode it.  If your decoder isn't
compliant (ie doesn't check for overly long sequences) then a
b'\xC0\xAF' gets translated into u'/', bypassing your previous check.

However, in this context we only need to allow lone surrogates.
CESU-8 comes to mind.  (It is a perverse world we live in.)

-- 
Adam Olsen, aka Rhamphoryncus
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Python-3.0, unicode, and os.environ

Reply via email to