[Python-Dev] PEP 393 decode() oddity

Serhiy Storchaka Sun, 25 Mar 2012 09:29:50 -0700

PEP 393 (Flexible String Representation) is, without doubt, one of thepearls of the Python 3.3. In addition to reducing memory consumption, italso often leads to a corresponding increase in speed. In particular,the string encoding now in 1.5-3 times faster.

But decoding is not so good. Here are the results of measuring theperformance of the decoding of the 1000-character string consisting ofcharacters from different ranges of the Unicode, for three versions ofPython -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686builds, gcc 4.4.


encoding  string                 2.7   3.2   3.3

ascii     " " * 1000             5.4   5.3   1.2

latin1    " " * 1000             1.8   1.7   1.3
latin1    "\u0080" * 1000        1.7   1.6   1.0

utf-8     " " * 1000             6.7   2.4   2.1
utf-8     "\u0080" * 1000       12.2  11.0  13.0
utf-8     "\u0100" * 1000       12.2  11.1  13.6
utf-8     "\u0800" * 1000       14.7  14.4  17.2
utf-8     "\u8000" * 1000       13.9  13.3  17.1
utf-8     "\U00010000" * 1000   17.3  17.5  21.5

utf-16le  " " * 1000             5.5   2.9   6.5
utf-16le  "\u0080" * 1000        5.5   2.9   7.4
utf-16le  "\u0100" * 1000        5.5   2.9   8.9
utf-16le  "\u0800" * 1000        5.5   2.9   8.9
utf-16le  "\u8000" * 1000        5.5   7.5  21.3
utf-16le  "\U00010000" * 1000    9.6  12.9  30.1

utf-16be  " " * 1000             5.5   3.0   9.0
utf-16be  "\u0080" * 1000        5.5   3.1   9.8
utf-16be  "\u0100" * 1000        5.5   3.1  10.4
utf-16be  "\u0800" * 1000        5.5   3.1  10.4
utf-16be  "\u8000" * 1000        5.5   6.6  21.2
utf-16be  "\U00010000" * 1000    9.6  11.2  28.9

utf-32le  " " * 1000            10.2  10.4  15.1
utf-32le  "\u0080" * 1000       10.0  10.4  16.5
utf-32le  "\u0100" * 1000       10.0  10.4  19.8
utf-32le  "\u0800" * 1000       10.0  10.4  19.8
utf-32le  "\u8000" * 1000       10.1  10.4  19.8
utf-32le  "\U00010000" * 1000   11.7  11.3  20.2

utf-32be  " " * 1000            10.0  11.2  15.0
utf-32be  "\u0080" * 1000       10.1  11.2  16.4
utf-32be  "\u0100" * 1000       10.0  11.2  19.7
utf-32be  "\u0800" * 1000       10.1  11.2  19.7
utf-32be  "\u8000" * 1000       10.1  11.2  19.7
utf-32be  "\U00010000" * 1000   11.7  11.2  20.2

The first oddity in that the characters from the second half of theLatin1 table decoded faster than the characters from the first half. Ithink that the characters from the first half of the table must bedecoded as quickly.

The second sad oddity in that UTF-16 decoding in 3.3 is much slower thaneven in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This isa considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.

The fact that in some cases UTF-8 decoding also slowed, is notsurprising. I believe, that on a platform with a 64-bit long, there maybe other oddities.

How serious a problem this is for the Python 3.3 release? I could do theoptimization, if someone is not working on this already.

# -*- coding: utf-8 -*-
import codecs, timeit

def bench_decode(encoding, string):
    try:
        x = eval(string).encode(encoding)
    except UnicodeEncodeError:
        return
    setup = '''
import codecs
d = codecs.getdecoder({encoding!r})
x = {x!r}
'''.format(encoding=encoding, x=x)
    t = timeit.Timer('d(x)', setup)
    repeat = 10
    number = 10000
    precision = 3
    r = t.repeat(repeat, number)
    best = min(r)
    usec = best * 1e6 / number
    print("%-8s  %-20s  %4.1f" % (encoding, string, usec))


for encoding in ('ascii', 'latin1', 'utf-8', 'utf-16le', 'utf-16be', 'utf-32le', 'utf-32be'):
    for string in ('" " * 1000', '"\\u0080" * 1000', '"\\u0100" * 1000', '"\\u0800" * 1000', '"\\u8000" * 1000', '"\\U00010000" * 1000'):
        bench_decode(encoding, string)
    print()

# -*- coding: utf-8 -*-
import codecs, timeit

def bench_decode(encoding, string):
    try:
        x = eval(string).encode(encoding)
    except UnicodeEncodeError:
        return
    setup = '''
import codecs
d = codecs.getdecoder({encoding!r})
x = {x!r}
'''.format(encoding=encoding, x=x)
    t = timeit.Timer('d(x)', setup)
    repeat = 10
    number = 10000
    precision = 3
    r = t.repeat(repeat, number)
    best = min(r)
    usec = best * 1e6 / number
    print("%-8s  %-20s  %4.1f" % (encoding, string, usec))


for encoding in ('ascii', 'latin1', 'utf-8', 'utf-16le', 'utf-16be', 'utf-32le', 'utf-32be'):
    for string in ('u" " * 1000', 'u"\\u0080" * 1000', 'u"\\u0100" * 1000', 'u"\\u0800" * 1000', 'u"\\u8000" * 1000', 'u"\\U00010000" * 1000'):
        bench_decode(encoding, string)
    print

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] PEP 393 decode() oddity

Reply via email to