PEP 393 (Flexible String Representation) is, without doubt, one of the pearls of the Python 3.3. In addition to reducing memory consumption, it also often leads to a corresponding increase in speed. In particular, the string encoding now in 1.5-3 times faster.

But decoding is not so good. Here are the results of measuring the performance of the decoding of the 1000-character string consisting of characters from different ranges of the Unicode, for three versions of Python -- 2.7.3rc2, 3.2.3rc2+ and 3.3.0a1+. Little-endian 32-bit i686 builds, gcc 4.4.

encoding  string                 2.7   3.2   3.3

ascii     " " * 1000             5.4   5.3   1.2

latin1    " " * 1000             1.8   1.7   1.3
latin1    "\u0080" * 1000        1.7   1.6   1.0

utf-8     " " * 1000             6.7   2.4   2.1
utf-8     "\u0080" * 1000       12.2  11.0  13.0
utf-8     "\u0100" * 1000       12.2  11.1  13.6
utf-8     "\u0800" * 1000       14.7  14.4  17.2
utf-8     "\u8000" * 1000       13.9  13.3  17.1
utf-8     "\U00010000" * 1000   17.3  17.5  21.5

utf-16le  " " * 1000             5.5   2.9   6.5
utf-16le  "\u0080" * 1000        5.5   2.9   7.4
utf-16le  "\u0100" * 1000        5.5   2.9   8.9
utf-16le  "\u0800" * 1000        5.5   2.9   8.9
utf-16le  "\u8000" * 1000        5.5   7.5  21.3
utf-16le  "\U00010000" * 1000    9.6  12.9  30.1

utf-16be  " " * 1000             5.5   3.0   9.0
utf-16be  "\u0080" * 1000        5.5   3.1   9.8
utf-16be  "\u0100" * 1000        5.5   3.1  10.4
utf-16be  "\u0800" * 1000        5.5   3.1  10.4
utf-16be  "\u8000" * 1000        5.5   6.6  21.2
utf-16be  "\U00010000" * 1000    9.6  11.2  28.9

utf-32le  " " * 1000            10.2  10.4  15.1
utf-32le  "\u0080" * 1000       10.0  10.4  16.5
utf-32le  "\u0100" * 1000       10.0  10.4  19.8
utf-32le  "\u0800" * 1000       10.0  10.4  19.8
utf-32le  "\u8000" * 1000       10.1  10.4  19.8
utf-32le  "\U00010000" * 1000   11.7  11.3  20.2

utf-32be  " " * 1000            10.0  11.2  15.0
utf-32be  "\u0080" * 1000       10.1  11.2  16.4
utf-32be  "\u0100" * 1000       10.0  11.2  19.7
utf-32be  "\u0800" * 1000       10.1  11.2  19.7
utf-32be  "\u8000" * 1000       10.1  11.2  19.7
utf-32be  "\U00010000" * 1000   11.7  11.2  20.2

The first oddity in that the characters from the second half of the Latin1 table decoded faster than the characters from the first half. I think that the characters from the first half of the table must be decoded as quickly.

The second sad oddity in that UTF-16 decoding in 3.3 is much slower than even in 2.7. Compared with 3.2 decoding is slower in 2-3 times. This is a considerable regress. UTF-32 decoding is also slowed down by 1.5-2 times.

The fact that in some cases UTF-8 decoding also slowed, is not surprising. I believe, that on a platform with a 64-bit long, there may be other oddities.

How serious a problem this is for the Python 3.3 release? I could do the optimization, if someone is not working on this already.
# -*- coding: utf-8 -*-
import codecs, timeit

def bench_decode(encoding, string):
    try:
        x = eval(string).encode(encoding)
    except UnicodeEncodeError:
        return
    setup = '''
import codecs
d = codecs.getdecoder({encoding!r})
x = {x!r}
'''.format(encoding=encoding, x=x)
    t = timeit.Timer('d(x)', setup)
    repeat = 10
    number = 10000
    precision = 3
    r = t.repeat(repeat, number)
    best = min(r)
    usec = best * 1e6 / number
    print("%-8s  %-20s  %4.1f" % (encoding, string, usec))


for encoding in ('ascii', 'latin1', 'utf-8', 'utf-16le', 'utf-16be', 'utf-32le', 'utf-32be'):
    for string in ('" " * 1000', '"\\u0080" * 1000', '"\\u0100" * 1000', '"\\u0800" * 1000', '"\\u8000" * 1000', '"\\U00010000" * 1000'):
        bench_decode(encoding, string)
    print()
# -*- coding: utf-8 -*-
import codecs, timeit

def bench_decode(encoding, string):
    try:
        x = eval(string).encode(encoding)
    except UnicodeEncodeError:
        return
    setup = '''
import codecs
d = codecs.getdecoder({encoding!r})
x = {x!r}
'''.format(encoding=encoding, x=x)
    t = timeit.Timer('d(x)', setup)
    repeat = 10
    number = 10000
    precision = 3
    r = t.repeat(repeat, number)
    best = min(r)
    usec = best * 1e6 / number
    print("%-8s  %-20s  %4.1f" % (encoding, string, usec))


for encoding in ('ascii', 'latin1', 'utf-8', 'utf-16le', 'utf-16be', 'utf-32le', 'utf-32be'):
    for string in ('u" " * 1000', 'u"\\u0080" * 1000', 'u"\\u0100" * 1000', 'u"\\u0800" * 1000', 'u"\\u8000" * 1000', 'u"\\U00010000" * 1000'):
        bench_decode(encoding, string)
    print
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to