I'm a little confused by the benchmark. Using re looks pretty competitive in terms of speed, and should be much more memory efficient.
# https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines) with open('/tmp/shakespeare.txt', 'r') as f: text = f.read() import re from itertools import * line_re = re.compile(r"\n") Then when I run it: In [25]: %timeit _ = list(accumulate(chain([0], map(len, text.splitlines(True))))) 30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)] 29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is off-by-one from the splitlines implementation.) What benchmark shows the regex to be significantly slower? That said, str.indexes(char) sounds like a reasonable addition. Best wishes, Lucas Wiman On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <jonat...@slenders.be> wrote: > Hi everyone, > > Today was the 3rd time I came across a situation where it was needed to > retrieve all the positions of the line endings (or beginnings) in a very > long python string as efficiently as possible. First time, it was needed in > prompt_toolkit, where I spent a crazy amount of time looking for the most > performant solution. Second time was in a commercial project where > performance was very critical too. Third time is for the Rich/Textual > project from Will McGugan. (See: > https://twitter.com/willmcgugan/status/1537782771137011715 ) > > The problem is that the `str` type doesn't expose any API to efficiently > find all \n positions. Every Python implementation is either calling > `.index()` in a loop and collecting the results or running a regex over the > string and collecting all positions. > > For long strings, depending on the implementation, this results in a lot > of overhead due to either: > - calling Python functions (or any other Python instruction) for every \n > character in the input. The amount of executed Python instructions is O(n) > here. > - Copying string data into new strings. > > The fastest solution I've been using for some time, does this (simplified): > `accumulate(chain([0], map(len, text.splitlines(True))))`. The > performance is great here, because the amount of Python instructions is > O(1). Everything is chained in C-code thanks to itertools. Because of that, > it can outperform the regex solution with a factor of ~2.5. (Regex isn't > slow, but iterating over the results is.) > > The bad things about this solution is however: > - Very cumbersome syntax. > - We call `splitlines()` which internally allocates a huge amount of > strings, only to use their lengths. That is still much more overhead then a > simple for-loop in C would be. > > Performance matters here, because for these kind of problems, the list of > integers that gets produced is typically used as an index to quickly find > character offsets in the original string, depending on which line is > displayed/processed. The bisect library helps too to quickly convert any > index position of that string into a line number. The point is, that for > big inputs, the amount of Python instructions executed is not O(n), but > O(1). Of course, some of the C code remains O(n). > > So, my ask here. > Would it make sense to add a `line_offsets()` method to `str`? > Or even `character_offsets(character)` if we want to do that for any > character? > Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary > lengths? > > Thanks, > Jonathan > > > > > > > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/6WAMKYXOYA3SKL5HIRZP4WARMYYKXI3Q/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/JZP5L57KORNOHILH7W3WZMCWRHGKTPQK/ Code of Conduct: http://python.org/psf/codeofconduct/