Re: [Python-Dev] Unicode charmap decoders slow

Walter Dörwald Wed, 05 Oct 2005 08:08:10 -0700

Martin v. Löwis wrote:

> Tony Nelson wrote:
> 
>>> For decoding it should be sufficient to use a unicode string of
>>> length 256. u"\ufffd" could be used for "maps to undefined". Or the
>>> string might be shorter and byte values greater than the length of
>>> the string are treated as "maps to undefined" too.
>>
>> With Unicode using more than 64K codepoints now, it might be more forward
>> looking to use a table of 256 32-bit values, with no need for tricky
>> values.
> 
> You might be missing the point. \ufffd is REPLACEMENT CHARACTER,
> which would indicate that the byte with that index is really unused
> in that encoding.


OK, here's a patch that implements this enhancement to 
PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939

The mapping argument to PyUnicode_DecodeCharmap() can be a unicode 
string and is used as a decoding table.

Speed looks like this:

python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 538 usec per loop
python2.4 -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
100 loops, best of 3: 3.85 msec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('utf-8')"
1000 loops, best of 3: 539 usec per loop
./python-cvs -mtimeit "s='a'*53*1024; u=unicode(s)" "s.decode('mac-roman')"
1000 loops, best of 3: 623 usec per loop

Creating the decoding_map as a string should probably be done by 
gencodec.py directly. This way the first import of the codec would be 
faster too.

Bye,
    Walter Dörwald
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode charmap decoders slow

Reply via email to