[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Paul Sokolovsky Tue, 31 Mar 2020 12:09:38 -0700

Hello,

On Mon, 30 Mar 2020 13:59:42 -0700
Andrew Barnert <abarn...@yahoo.com> wrote:

> On Mar 30, 2020, at 13:06, Paul Sokolovsky <pmis...@gmail.com> wrote:
> > 
> > I appreciate expressing it all concisely and clearly. Then let me
> > respond here instead of the very first '"".join() rules!' reply I
> > got.  
> 
> Ignoring replies doesn’t actually answer them.

I'm happy to discuss various points, but it would be nice to have
discussion focused, giving that the change proposed is pretty simple.
I'm not sure if it my fault by having tried to structure the original
RFC as a poor-man's PEP (so it's somewhat long'ish), but I definitely
would like to avoid discussing extended topics along the lines of
"there're some mundane languages which offer those string builder
classes, but Python is so, SO, special, that it doesn't need it, and
whoever thinks otherwise just doesn't get it" or "building a string
from pieces by putting pointers to pieces into array, and then
concatenating them together is the PEAK achievement of the computer
science, and whoever didn't get that just... just... didn't read
CPython (yes, CPython!) FAQ".

> > The issue with "".join() is very obvious:
> > 
> > ------
> > import io
> > import sys
> > 
> > 
> > def strio():
> >    sb = io.StringIO()
> >    for i in range(50000):
> >        sb.write(u"==%d==" % i)
> >    print(sys.getsizeof(sb) + sys.getsizeof(sb.getvalue()))  
> 
> This doesn’t tell you anything useful. As the help for getsizeof
> makes clear, “Only the memory consumption directly attributed to the
> object is accounted for, not the memory consumption of objects it
> refers to”. So this gives you some fixed value like 152, no matter
> how big the buffer and other internal objects may be.

Yeah, I tried to account for that with "sys.getsizeof(sb) +
sys.getsizeof(sb.getvalue())", thanks for noticing that.

> If you’re using CPython with the C accelerator, none of those things
> are available to you from the API, but a quick scan of the C source
> shows what’s there, and it’s generally actually more storage than the
> list version. Oversimplifying a bit: While you’re building, it keeps
> a _PyAccu structure, which is basically a wrapper around that same
> list of strings. When you call getvalue() it then builds a Py_UCS4*
> representation that’s in this case 4x the size of the final string
> (since your string is pure ASCII and will be stored in UCS1, not
> UCS4). And then there’s the final string.

Thanks very much for this intro into the CPython
io.StringIO implementation, much appreciated. Please let me return the
favor and explain how StringIO implemented in Pycopy, which I happen to
maintain, and in MicroPython (as the original implementation was written
by me there). So, there's an array of bytes. Both implementations use
utf-8 to store strings. So, StringIO stores as many bytes as there're
actual data in (utf-8) strings. Of course, there's some over-allocation
policy to avoid severe quadratic behavior on growing. Overall, storing N
bytes of string data requires N + small % of N bytes of data. No
additional array of pointers is needed. Original constituent strings
(each over-allocated of course) can be GCed in the meantime.

The moral is known, and was stated in the original RFC: for as
long as somebody's attention is fixated on CPython, the likely
reply from them would be: "there's no problem with CPython3, so there's
nothing to fix". It takes to step up, think about *multiple*
implementations and *interface* they *can* offer.

> So, if this memory issue makes join unacceptable, it makes your
> optimization even more unacceptable.
> 
> And thinking about portable code makes it even worse. Your code might
> be run under CPython and take even more memory, or it might be run
> under a different Python implementation where StringIO is not
> accelerated (where it’s just a TextIOWrapper around a BytesIO) and
> therefore be a whole lot slower instead. So it has to be able to deal
> with both of those possibilities, not just one; code that uses the
> usual idiom, on the other hand, behaves pretty similarly on all
> implementations.

Indeed, it absolutely and guaranteedly wastes a lot of memory. (It's
also the fastest, no worries.) 

> > There's absolutely no need why performing trivial operation of
> > accumulating string content should take about order of magnitude
> > more memory than actually needed for that string content. Don't get
> > me wrong
> > - if you want to spend that much of your memory, then sure, you
> > can. But jumping with that as *the only right solution* whenever
> > somebody mentions "string concatenation" is a bit ... umm,
> > cavalier  
> 
> And making a wild guess about how things might be implemented and
> offering an optimization based on that guess that actually makes
> things worse and refusing to even reply when people point out the
> problems isn’t even more cavalier?

The point I tried to show is that StringIO is never worse than str +=
regarding performance (stats for 8 implementations were demonstrated).
What went implied is that it can be also very memory-efficient, but
thanks to your thorough attention, that now was made explicit, with an
implementation (very simple and obvious!) on achieving that described.

I'm sorry to hear about deficiencies in StringIO implementation of
your favorite Python implementation. On the positive side, now that
they're identified, they can be fixed (if there's a need to care
about them for that particular implementation).

Likewise, I'm sorry for now showing a full possible extent of
appreciation of your joining the discussion of the "StringIO vs str +="
matters with claims like "str.join is the fastest!!", with myself not
showing that fullest extent of appreciation by repeatedly calling to
stay on the topic of improving interface for string building to be on
the same level as simple and obvious "str +=". I still tried to answer
why str.join can't be a universal solution for all cases, I'm sorry if I
failed to do that.

> > My whole concern is along 2 lines:
> > 
> > 1. This StringBuilder class *could* be an existing io.StringIO.
> > 2. By just adding __iadd__ operator to it.  
> 
> No, it really couldn’t. The semantics are wrong (unless you want,
> say, universal newline handling in your string builder?), it’s
> optimized for a different use case than string building, and both the
> pure-Python and CPython accelerator implementations are less
> efficient in speed and/or memory.

Less efficient than what? I start with simple and obvious
"str +=", but vividly inefficient across different Python
implementations. I proceed with proposing how with a very simple
change, simplicity and obviousness of "str +=" can be retained, while
runtime efficiency can be dramatically improved (without any special
implied memory use deficiencies).

You keep pushing that "there's a faster way to do it". Yes, you're
right - there's. But my proposal was never about "fastest string concat
in the west", or it would have been about rewriting some code in
assembler. 

> > That's it, nothing else. What's inside StringIO class is up to you
> > (dear various Python implementations, their maintainers, and
> > contributors).  
> 
> Sure, but what’s inside has to actually perform the job it was
> designed to do and is documented to do: to simulate a file object in
> memory. Which is not the same thing as being a string builder.

Once somebody would try to implement a dedicated "string builder", they
would find that it's some 80% similar to "simulate a file object in
memory". On average. I'm sorry to hear about outlier implementations
where (per your words), similarity is less than that.

> > For example, fans of "".join() surely can have it inside. Actually,
> > it's a known fact that Python2's "StringIO" module (the original
> > home of StringIO class) was implemented exactly like that, so you
> > can go straight back to the future.  
> 
> Python2’s StringIO module is for bytes, not Unicode strings.

It just occurred to me: maybe I chose the wrong class for running
discussion, maybe that should have been BytesIO, and you'd be half won
over by now? ;-)

> If you
> want a mutable bytes-like type, bytearray already exists; there’s no
> need to wrap the sequence up in a file-like API just to rewrap that
> in a sequence-like API again;

I humbly disagree. And the motivation is exactly parallel to that of
str vs io.StringIO. For (binary)string-builder, you constantly need to
grow its internal buffer. You also need to do the same for "simulating a
file in memory". Then once you have an object which does that
(hopefully efficiently, again "ah" to those which don't), you don't
need to complicate implementation of other objects to optimize for
the "growing" case. Just use an object suitable for a particular
usecase: bytearray for inplace updates, and BytesIO for
growing-construction.

I'm sorry in advance if FAQ for your Python implementation doesn't
provide such suggestions. FAQs for other Python implementation very
well may.

> just use the sequence directly. What
> StringIO is there for is when you _need_ the file API, just as in
> Python 3’s io.BytesIO. It’s not a more efficient bytearray or one
> better suited for string building; it’s less efficient and less well
> suited for string building but it adds different features.
> 
> > And again, the need for anything like that might be unclear for
> > CPython-only users. Such users can write a StringBuilder class like
> > above, or repeat the beautiful "".join() trick over and over again.
> > The need for a nice string builder class may occur only from the
> > consideration of the Python-as-a-language lacking a clear and nice
> > abstraction for it, and from thinking how to add such an
> > abstraction in a performant way (of which criteria are different)
> > in as many implementation as possible, in as easy as possible way.
> > (At least that's my path to it, I'm not sure if a different thought
> > process might lead to it too.)  
> 
> The problem isn’t your start, it’s jumping to the assumption that
> StringIO must be an answer, and then not checking the docs and the

Wrong claim. I just suggest that it *can* be an answer.

> code to see if there are problems, and then ignoring the problems
> when they’re pointed out. Why do you think a virtual file object must
> be the optimal way to implement a string builder in the first place?

Wrong claim: I don't say "optimal" (after all, you suggested that
there's a faster way, and in some cases that can be "optimal"). I
would say a "good compromise".

-- 
Best regards,
 Paul                          mailto:pmis...@gmail.com
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/L6Z2DXAPDIZVGHZ6EJ5J4PT6APB32EQX/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Explicitly defining a string buffer object (aka StringIO += operator)

Reply via email to