New submission from syl-nktaylor <syl.nktay...@gmail.com>:
In https://github.com/python/cpython/blob/master/Objects/unicodeobject.c#L12930, unicode_repeat does string multiplication with an integer in 3 different ways: 1) one memset call, for utf-8 when string size is 1 2) linear 'for' loops, for utf-16 and utf-32 when string size is 1 3) logarithmic 'while' loop with memcpy calls, for utf-8/utf-16/utf-32 when string size > 1 Is there a performance or correctness reason for which we can't also use the 3rd way for the second case? I realise depending on architecture, the second case could benefit from vectorization, but the memcpy calls will also be hardware optimised. An example of using just the 1st and 3rd methods: --- a/Objects/unicodeobject.c +++ b/Objects/unicodeobject.c @@ -12954,31 +12954,16 @@ unicode_repeat(PyObject *str, Py_ssize_t len) assert(PyUnicode_KIND(u) == PyUnicode_KIND(str)); - if (PyUnicode_GET_LENGTH(str) == 1) { - int kind = PyUnicode_KIND(str); - Py_UCS4 fill_char = PyUnicode_READ(kind, PyUnicode_DATA(str), 0); - if (kind == PyUnicode_1BYTE_KIND) { - void *to = PyUnicode_DATA(u); - memset(to, (unsigned char)fill_char, len); - } - else if (kind == PyUnicode_2BYTE_KIND) { - Py_UCS2 *ucs2 = PyUnicode_2BYTE_DATA(u); - for (n = 0; n < len; ++n) - ucs2[n] = fill_char; - } else { - Py_UCS4 *ucs4 = PyUnicode_4BYTE_DATA(u); - assert(kind == PyUnicode_4BYTE_KIND); - for (n = 0; n < len; ++n) - ucs4[n] = fill_char; - } - } - else { + Py_ssize_t char_size = PyUnicode_KIND(str); + char *to = (char *) PyUnicode_DATA(u); + if (PyUnicode_GET_LENGTH(str) == 1 && char_size == PyUnicode_1BYTE_KIND) { + Py_UCS4 fill_char = PyUnicode_READ(char_size, PyUnicode_DATA(str), 0); + memset(to, fill_char, len); + } else { /* number of characters copied this far */ Py_ssize_t done = PyUnicode_GET_LENGTH(str); - Py_ssize_t char_size = PyUnicode_KIND(str); - char *to = (char *) PyUnicode_DATA(u); memcpy(to, PyUnicode_DATA(str), PyUnicode_GET_LENGTH(str) * char_size); while (done < nchars) {... ---------- components: Unicode messages: 382681 nosy: ezio.melotti, syl-nktaylor, vstinner priority: normal severity: normal status: open title: Consistency in unicode string multiplication with an integer type: behavior versions: Python 3.10 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue42593> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com