[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Andrew Barnert via Python-ideas Mon, 30 Mar 2020 14:03:22 -0700

On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmis...@gmail.com> wrote:
> 
> I appreciate expressing it all concisely and clearly. Then let me
> respond here instead of the very first '"".join() rules!' reply I got.


Ignoring replies doesn’t actually answer them.

> The issue with "".join() is very obvious:
> 
> ------
> import io
> import sys
> 
> 
> def strio():
>    sb = io.StringIO()
>    for i in range(50000):
>        sb.write(u"==%d==" % i)
>    print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue()))

This doesn’t tell you anything useful. As the help for getsizeof makes clear, 
“Only the memory consumption directly attributed to the object is accounted 
for, not the memory consumption of objects it refers to”. So this gives you 
some fixed value like 152, no matter how big the buffer and other internal 
objects may be.

If you’re using CPython with the C accelerator, none of those things are 
available to you from the API, but a quick scan of the C source shows what’s 
there, and it’s generally actually more storage than the list version. 
Oversimplifying a bit: While you’re building, it keeps a _PyAccu structure, 
which is basically a wrapper around that same list of strings. When you call 
getvalue() it then builds a Py_UCS4* representation that’s in this case 4x the 
size of the final string (since your string is pure ASCII and will be stored in 
UCS1, not UCS4). And then there’s the final string.

So, if this memory issue makes join unacceptable, it makes your optimization 
even more unacceptable.

And thinking about portable code makes it even worse. Your code might be run 
under CPython and take even more memory, or it might be run under a different 
Python implementation where StringIO is not accelerated (where it’s just a 
TextIOWrapper around a BytesIO) and therefore be a whole lot slower instead. So 
it has to be able to deal with both of those possibilities, not just one; code 
that uses the usual idiom, on the other hand, behaves pretty similarly on all 
implementations.

> There's absolutely no need why performing trivial operation of
> accumulating string content should take about order of magnitude more
> memory than actually needed for that string content. Don't get me wrong
> - if you want to spend that much of your memory, then sure, you can. But
> jumping with that as *the only right solution* whenever somebody
> mentions "string concatenation" is a bit ... umm, cavalier

And making a wild guess about how things might be implemented and offering an 
optimization based on that guess that actually makes things worse and refusing 
to even reply when people point out the problems isn’t even more cavalier?

> My whole concern is along 2 lines:
> 
> 1. This StringBuilder class *could* be an existing io.StringIO.
> 2. By just adding __iadd__ operator to it.

No, it really couldn’t. The semantics are wrong (unless you want, say, 
universal newline handling in your string builder?), it’s optimized for a 
different use case than string building, and both the pure-Python and CPython 
accelerator implementations are less efficient in speed and/or memory.

> That's it, nothing else. What's inside StringIO class is up to you (dear
> various Python implementations, their maintainers, and contributors).

Sure, but what’s inside has to actually perform the job it was designed to do 
and is documented to do: to simulate a file object in memory. Which is not the 
same thing as being a string builder.

> For example, fans of "".join() surely can have it inside. Actually,
> it's a known fact that Python2's "StringIO" module (the original home
> of StringIO class) was implemented exactly like that, so you can go
> straight back to the future.

Python2’s StringIO module is for bytes, not Unicode strings. If you want a 
mutable bytes-like type, bytearray already exists; there’s no need to wrap the 
sequence up in a file-like API just to rewrap that in a sequence-like API 
again; just use the sequence directly. What StringIO is there for is when you 
_need_ the file API, just as in Python 3’s io.BytesIO. It’s not a more 
efficient bytearray or one better suited for string building; it’s less 
efficient and less well suited for string building but it adds different 
features.

> And again, the need for anything like that might be unclear for
> CPython-only users. Such users can write a StringBuilder class like
> above, or repeat the beautiful "".join() trick over and over again. The
> need for a nice string builder class may occur only from the
> consideration of the Python-as-a-language lacking a clear and nice
> abstraction for it, and from thinking how to add such an abstraction in
> a performant way (of which criteria are different) in as many
> implementation as possible, in as easy as possible way. (At least
> that's my path to it, I'm not sure if a different thought process might
> lead to it too.)

The problem isn’t your start, it’s jumping to the assumption that StringIO must 
be an answer, and then not checking the docs and the code to see if there are 
problems, and then ignoring the problems when they’re pointed out. Why do you 
think a virtual file object must be the optimal way to implement a string 
builder in the first place?
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BHKUGCLSFQH3HRVPVOO7XJZYSBXRZXP2/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Reply via email to