[Python-Dev] Re: _PyBytesWriter/_PyUnicodeWriter could be faster

Victor Stinner Mon, 26 Oct 2020 12:15:43 -0700

Hi,

Le dim. 25 oct. 2020 à 15:36, Ma Lin <malin...@163.com> a écrit :
> Some code needs to maintain an output buffer that has an unpredictable size. 
> Such as bz2/lzma/zlib modules, _PyBytesWriter/_PyUnicodeWriter.
>
> In current code, when the output buffer grows, resizing will cause 
> unnecessary memcpy().
>
> issue41486 uses memory blocks to represent output buffer in bz2/lzma/zlib 
> modules, it could eliminate the overhead of resizing.


Some context.

_PyBytesWriter is an internal C API designed for C functions which
return a bytes or a bytearray object and use a loop writing into "ptr"
(pointer into a bytes buffer). Such functions expect a single
contiguous memory block. It is based on realloc() and overallocation
(which can be disabled in the API). It uses a bytes object which is
resized on demand. It also uses a short buffer of 512 bytes allocated
on the stack memory for short strings. _PyBytesWriter_Finish() calls
_PyBytes_Resize() if needed.

In 2016, I wrote an article on this API:
https://vstinner.github.io/pybyteswriter.html

realloc() does not always imply to copy memory. Growing a memory block
can sometimes be done in-place (no data copy). Same when you shrink a
memory block in _PyBytesWriter_Finish(). Also, overallocation reduces
the number of recall() calls. _PyBytesWriter design is optimized for
short strings up to 100 bytes.

--

_PyUnicodeWriter API is designed for the PEP 393 compact string
structure (ASCII, Py_UCS1 latin1, Py_UCS2 and Py_UCS4 formats). It
tries to reduce conversions between the 3 formats (Py_UCS1, Py_UCS2
and Py_UCS4) and also uses overallocation to reduce memory copies.

--

By the way, _PyBytesWriter and _PyUnicodeWriter overallocation is
different on Windows:

#ifdef MS_WINDOWS
   /* On Windows, overallocate by 50% is the best factor */
#  define OVERALLOCATE_FACTOR 2
#else
   /* On Linux, overallocate by 25% is the best factor */
#  define OVERALLOCATE_FACTOR 4
#endif

--

The internal C API _PyAccu is a variant of _PyUnicodeWriter which uses
a list of short strings and sometimes concatenates these strings into
a single large string.


> _PyBytesWriter/_PyUnicodeWriter could use the same way.
>
> If write a "general blocks output buffer", it could be used in 
> _PyBytesWriter/bz2/lzma/zlib. (issue41486 is not very general, it uses a 
> bytes object to represent a memory block.)

I understand that the main idea is to not use a single buffer, but use
a list of buffers, and concatenate them in
_BlocksOutputBuffer_Finish(). Similar idea to PyAccu API.

Maybe some functions using _PyBytesWriter can be adapted to use a list
of buffers rather than a single buffer. But I'm not convinced that it
would make them faster. The question is which kind of functions you
want to optimize, for which string length, etc. You should dig into
the old issues where I optimized str%args and str.format():

* http://bugs.python.org/issue14687 : str % args
* http://bugs.python.org/issue14744 : str.format()
* https://bugs.python.org/issue2534 : bytes % args

I used benchmarks like:

https://github.com/vstinner/pymicrobench/blob/master/bench_bytes_format_int.py
https://github.com/vstinner/pymicrobench/blob/master/bench_str_format.py
https://github.com/vstinner/pymicrobench/blob/master/bench_str_format_keywords.py


> If write a new _PyUnicodeWriter like this, it has a chance to eliminate the 
> overhead of switching PyUnicode_Kind (record the switching position):
>
>     'a' * 100_000_000 + '\uABCD'

For a+b, Python first computes "a", then "b", and finally "a+b". I
don't see how your API could optimize such code.

For operations on strings like "%s%s" % (a, b) or "{}{}".format(a, b),
Python internally uses _PyUnicodeWriter. To format "a",
_PyUnicodeWriter just stores a reference to it as
_PyUnicodeWriter.buffer and marks the buffer as read-only
(optimization when the result is made of a single string: no copy is
made at all!). To format "b", _PyUnicodeWriter_WriteStr() converts the
buffer to Py_UCS2 and then writes the new string.

The "a" string is only written "once", not twice. I don't see how your
API would avoid copies in such cases.

Moreover, str % args and str.format() are optimized to avoid
over-allocation when "b" is written: the final
_PyUnicodeWriter_Finish() call is free, it does nothing.


> If anyone has time and is willing to try, it's very welcome.
> Or I might do this at sometime in the future.

I can be completely wrong, please try and show benchmarks proving that
your approach is faster on specific use cases, without hurting
performances on short strings ;-)

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/O3T6B3HDO24M3W5NZE2RCR7FCZTMAWV3/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: _PyBytesWriter/_PyUnicodeWriter could be faster

Reply via email to