On 16/09/2010, Guido van Rossum <gu...@python.org> wrote:
> On Thu, Sep 16, 2010 at 11:16 AM, Toshio Kuratomi <a.bad...@gmail.com>
> wrote:
>> You were talking about encodings that were supersets of 7-bit ASCII.
>> I think Martin was demonstrating a byte string that was a superset of
>> 7-bit
>> ASCII being fed to a stdlib function which went wrong.
>
> Whoops, sorry. I don't have access to Windows so I can't reproduce
> this though. I also don't understand it. What is the Unicode codepoint
> for that 十 character? What is sys.getfilesystemencoding()? What is the
> value of "C:\\十".encode(sys.getfilesystemencoding())?

My fault, should have been clearer. I was trying to demonstrate that
there's a difference between the unix-friendly encodings like UTF-8
and the EUC codecs which only use high-bit characters for non-ascii
text, and the ISO-2022 codecs and Shift JIS.

In the example I gave, 十 encodes in CP932 as '\x8f\\', and the
function gets confused by the second byte. Obviously the right answer
there is just to use unicode, rather than write a function that works
with weird multibyte codecs.

Martin
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to