[Python-ideas] Re: Add a line_offsets() method to str

Lucas Wiman Sat, 18 Jun 2022 12:12:42 -0700

I'm a little confused by the benchmark. Using re looks pretty competitive
in terms of speed, and should be much more memory efficient.


# https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines)
with open('/tmp/shakespeare.txt', 'r') as f:
    text = f.read()
import re
from itertools import *
line_re = re.compile(r"\n")

Then when I run it:
In [25]: %timeit _ = list(accumulate(chain([0], map(len,
text.splitlines(True)))))
30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)]
29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is
off-by-one from the splitlines implementation.)

What benchmark shows the regex to be significantly slower?

That said, str.indexes(char) sounds like a reasonable addition.

Best wishes,
Lucas Wiman

On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <[email protected]>
wrote:

> Hi everyone,
>
> Today was the 3rd time I came across a situation where it was needed to
> retrieve all the positions of the line endings (or beginnings) in a very
> long python string as efficiently as possible. First time, it was needed in
> prompt_toolkit, where I spent a crazy amount of time looking for the most
> performant solution. Second time was in a commercial project where
> performance was very critical too. Third time is for the Rich/Textual
> project from Will McGugan. (See:
> https://twitter.com/willmcgugan/status/1537782771137011715 )
>
> The problem is that the `str` type doesn't expose any API to efficiently
> find all \n positions. Every Python implementation is either calling
> `.index()` in a loop and collecting the results or running a regex over the
> string and collecting all positions.
>
> For long strings, depending on the implementation, this results in a lot
> of overhead due to either:
> - calling Python functions (or any other Python instruction) for every \n
> character in the input. The amount of executed Python instructions is O(n)
> here.
> - Copying string data into new strings.
>
> The fastest solution I've been using for some time, does this (simplified):
> `accumulate(chain([0], map(len, text.splitlines(True))))`. The
> performance is great here, because the amount of Python instructions is
> O(1). Everything is chained in C-code thanks to itertools. Because of that,
> it can outperform the regex solution with a factor of ~2.5. (Regex isn't
> slow, but iterating over the results is.)
>
> The bad things about this solution is however:
> - Very cumbersome syntax.
> - We call `splitlines()` which internally allocates a huge amount of
> strings, only to use their lengths. That is still much more overhead then a
> simple for-loop in C would be.
>
> Performance matters here, because for these kind of problems, the list of
> integers that gets produced is typically used as an index to quickly find
> character offsets in the original string, depending on which line is
> displayed/processed. The bisect library helps too to quickly convert any
> index position of that string into a line number. The point is, that for
> big inputs, the amount of Python instructions executed is not O(n), but
> O(1). Of course, some of the C code remains O(n).
>
> So, my ask here.
> Would it make sense to add a `line_offsets()` method to `str`?
> Or even `character_offsets(character)` if we want to do that for any
> character?
> Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary
> lengths?
>
> Thanks,
> Jonathan
>
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/[email protected]/message/6WAMKYXOYA3SKL5HIRZP4WARMYYKXI3Q/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/JZP5L57KORNOHILH7W3WZMCWRHGKTPQK/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a line_offsets() method to str

Reply via email to