Martin v. Löwis wrote: > Tony Nelson wrote: > >>> For decoding it should be sufficient to use a unicode string of >>> length 256. u"\ufffd" could be used for "maps to undefined". Or the >>> string might be shorter and byte values greater than the length of >>> the string are treated as "maps to undefined" too. >> >> With Unicode using more than 64K codepoints now, it might be more forward >> looking to use a table of 256 32-bit values, with no need for tricky >> values. > > You might be missing the point. \ufffd is REPLACEMENT CHARACTER, > which would indicate that the byte with that index is really unused > in that encoding.
OK, here's a patch that implements this enhancement to PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939 The mapping argument to PyUnicode_DecodeCharmap() can be a unicode string and is used as a decoding table. Speed looks like this: python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 538 usec per loop python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 100 loops, best of 3: 3.85 msec per loop ./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')" 1000 loops, best of 3: 539 usec per loop ./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')" 1000 loops, best of 3: 623 usec per loop Creating the decoding_map as a string should probably be done by gencodec.py directly. This way the first import of the codec would be faster too. Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com