[Python-ideas] Re: Add a line_offsets() method to str

Jonathan Slenders Sat, 18 Jun 2022 13:58:54 -0700

Good catch! One correction here, I somewhat mixed up the benchmarks. I
forgot both projects of mine required support for universal line endings
exactly like splitlines() does this out of the box. That requires a more
complex regex pattern. I was actually using:
re.compile(r"\n|\r(?!\n)")
And then the regex becomes significantly slower than the splitlines()
solution, which is still much slower than it has to be.


This makes me realize that `str.indexes(char)` is actually not what I need,
but really a `str.line_offsets()` which returns exactly the positions that
`str.splitlines()` would use. Does that make sense?

If this is reasonable, I wouldn't mind working on the implementation.

(@Christophe: In Python, a single string as a data structure is often much
easier to deal with and overall extremely performant. Try searching over a
list of lines.)

Thanks,
Jonathan




Le sam. 18 juin 2022 à 21:09, Lucas Wiman <lucas.wi...@gmail.com> a écrit :

> I'm a little confused by the benchmark. Using re looks pretty competitive
> in terms of speed, and should be much more memory efficient.
>
> # https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines)
> with open('/tmp/shakespeare.txt', 'r') as f:
>     text = f.read()
> import re
> from itertools import *
> line_re = re.compile(r"\n")
>
> Then when I run it:
> In [25]: %timeit _ = list(accumulate(chain([0], map(len,
> text.splitlines(True)))))
> 30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)]
> 29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>
> This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is
> off-by-one from the splitlines implementation.)
>
> What benchmark shows the regex to be significantly slower?
>
> That said, str.indexes(char) sounds like a reasonable addition.
>
> Best wishes,
> Lucas Wiman
>
> On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <jonat...@slenders.be>
> wrote:
>
>> Hi everyone,
>>
>> Today was the 3rd time I came across a situation where it was needed to
>> retrieve all the positions of the line endings (or beginnings) in a very
>> long python string as efficiently as possible. First time, it was needed in
>> prompt_toolkit, where I spent a crazy amount of time looking for the most
>> performant solution. Second time was in a commercial project where
>> performance was very critical too. Third time is for the Rich/Textual
>> project from Will McGugan. (See:
>> https://twitter.com/willmcgugan/status/1537782771137011715 )
>>
>> The problem is that the `str` type doesn't expose any API to efficiently
>> find all \n positions. Every Python implementation is either calling
>> `.index()` in a loop and collecting the results or running a regex over the
>> string and collecting all positions.
>>
>> For long strings, depending on the implementation, this results in a lot
>> of overhead due to either:
>> - calling Python functions (or any other Python instruction) for every \n
>> character in the input. The amount of executed Python instructions is O(n)
>> here.
>> - Copying string data into new strings.
>>
>> The fastest solution I've been using for some time, does this
>> (simplified): `accumulate(chain([0], map(len, text.splitlines(True))))`.
>> The performance is great here, because the amount of Python instructions is
>> O(1). Everything is chained in C-code thanks to itertools. Because of that,
>> it can outperform the regex solution with a factor of ~2.5. (Regex isn't
>> slow, but iterating over the results is.)
>>
>> The bad things about this solution is however:
>> - Very cumbersome syntax.
>> - We call `splitlines()` which internally allocates a huge amount of
>> strings, only to use their lengths. That is still much more overhead then a
>> simple for-loop in C would be.
>>
>> Performance matters here, because for these kind of problems, the list of
>> integers that gets produced is typically used as an index to quickly find
>> character offsets in the original string, depending on which line is
>> displayed/processed. The bisect library helps too to quickly convert any
>> index position of that string into a line number. The point is, that for
>> big inputs, the amount of Python instructions executed is not O(n), but
>> O(1). Of course, some of the C code remains O(n).
>>
>> So, my ask here.
>> Would it make sense to add a `line_offsets()` method to `str`?
>> Or even `character_offsets(character)` if we want to do that for any
>> character?
>> Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary
>> lengths?
>>
>> Thanks,
>> Jonathan
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Python-ideas mailing list -- python-ideas@python.org
>> To unsubscribe send an email to python-ideas-le...@python.org
>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-ideas@python.org/message/6WAMKYXOYA3SKL5HIRZP4WARMYYKXI3Q/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/2V65TPPA237YMZNLL2TWKO34XVZPXYJR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a line_offsets() method to str

Reply via email to