[issue41377] memoryview of str (unicode)

jakirkham Thu, 23 Jul 2020 13:47:20 -0700

New submission from jakirkham <jakirk...@gmail.com>:

When working with lower level C/C++ code, the Python Buffer Protocol[1] has 
been immensely useful as it allows common Python `bytes`-like objects to expose 
the underlying memory buffer in a pointer that C/C++ code can easily work with 
zero-copy. In fact `memoryview` objects can be quite handy when facilitating 
coercion of Python objects supporting the Python Buffer Protocol to something 
that Python and/or C/C++ code can use easily. This works with several Python 
objects, many Python APIs, and in is relied on heavily by many performance 
conscious 3rd party libraries.


However one object that gets a lot of use in Python that doesn't support this 
API is the Python `str` (previously `unicode`) object (see code below).

```python
In [1]: s = "Hello World!"                                                      

In [2]: mv = memoryview(s)                                                      
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-3403c1ca3811> in <module>
----> 1 mv = memoryview(s)

TypeError: memoryview: a bytes-like object is required, not 'str'
```

The canonical answer today is [to encode to `bytes` first]( 
https://stackoverflow.com/a/54449407 ) and decode to `str` later. While this is 
ok for a smallish piece of text, it can start to slowdown considerably for 
larger pieces of text. So being able to skip this encode/decode step can be 
quite impactful.

```python
In [1]: s = "Hello World!"                                                      

In [2]: %timeit s.encode();                                                     
54.9 ns ± 0.0788 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: s = 100_000_000 * "Hello World!"                                        

In [4]: %timeit s.encode();                                                     
729 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

AIUI (though I could be misunderstanding things) `str` objects do use some kind 
of typed array of unicode characters (either 16-bit narrow or 32-bit wide). So 
it seems like it *should* be possible to expose this as a 1-D contiguous array 
that C/C++ code could use. Though I may be misunderstanding how `str`s actually 
work under-the-hood (if so apologies).

It would be quite helpful to bypass this encoding/decoding step and instead 
work directly with the underlying buffer in these situations where C/C++ is 
involved to help performance critical code.

[1]: https://docs.python.org/3/c-api/buffer.html

----------
components: Library (Lib)
messages: 374147
nosy: jakirkham
priority: normal
severity: normal
status: open
title: memoryview of str (unicode)
type: enhancement
versions: Python 3.10, Python 3.8, Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41377>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41377] memoryview of str (unicode)

Reply via email to