Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

Stephen J. Turnbull Wed, 06 May 2009 19:29:19 -0700

"Martin v. Löwis" writes:

 > > Now, with Python's file system encoding == UTF-8 or any packed EUC,
 > > and more than a handful of Shift JIS or Big5 characters in file names,
 > > one is *almost certain* to encounter ASCII as the second byte of a
 > > multibyte sequence.  PEP 383 can't handle this


Ah, I see.  Of course, the algorithm not only has to handle the ASCII
octet which is erroneous because it can't be a trailing byte, but
*also the leading byte that signalled to expect a trailing byte >127*.
So the algorithm backs up to the character boundary (which is
well-defined for all the "sane" encodings), encode the high byte(s) in
the character with lone surrogates, and encode the ASCII as itself
(promoted to a Unicode code point).

Sorry, you're right, I was just confused.  I withdraw the objection as
completely mistaken, and apologize for not thinking more carefully in
the first place.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] PEP 383 update: utf8b is now the error handler

Reply via email to