[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

John Machin Thu, 01 Apr 2010 07:43:31 -0700

John Machin <[email protected]> added the comment:

Patch review:


Preamble: pardon my ignorance of how the codebase works, but trunk 
unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k 
unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the 
r79542 that the patch mentions ... help, please!

length 2 case: 
1. the loop can be hand-unrolled into oblivion. It can be entered only when 
s[1] & 0xC0 != 0x80 (previous if test).
2. the over-long check (if (ch < 0x80)) hasn't been touched. It could be 
removed and the entries for C0 and C1 in the utf8_code_length array set to 0.

length 3 case:
1. the tests involving s[0] being 0xE0 or 0xED are misplaced.
2. the test s[0] == 0xE0 && s[1] < 0xA0 if not misplaced would be shadowing the 
over-long test (ch < 0x800). It seems better to use the over-long test (with 
endinpos set to 1).
3. The test s[0] == 0xED relates to the surrogates caper which in the py3k 
version is handled in the same place as the over-long test.
4. unrolling loop: needs no loop, only 1 test ... if s[1] is good, then we know 
s[2] must be bad without testing it, because we start the for loop only when 
s[1] is bad || s[2] is bad.

length 4 case: as for the len 3 case generally ... misplaced tests, F1 test 
shadows over-long test, F4 test shadows max value test, too many loop 
iterations.

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue8271>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Reply via email to