Good catch! One correction here, I somewhat mixed up the benchmarks. I forgot both projects of mine required support for universal line endings exactly like splitlines() does this out of the box. That requires a more complex regex pattern. I was actually using: re.compile(r"\n|\r(?!\n)") And then the regex becomes significantly slower than the splitlines() solution, which is still much slower than it has to be.
This makes me realize that `str.indexes(char)` is actually not what I need, but really a `str.line_offsets()` which returns exactly the positions that `str.splitlines()` would use. Does that make sense? If this is reasonable, I wouldn't mind working on the implementation. (@Christophe: In Python, a single string as a data structure is often much easier to deal with and overall extremely performant. Try searching over a list of lines.) Thanks, Jonathan Le sam. 18 juin 2022 à 21:09, Lucas Wiman <lucas.wi...@gmail.com> a écrit : > I'm a little confused by the benchmark. Using re looks pretty competitive > in terms of speed, and should be much more memory efficient. > > # https://www.gutenberg.org/cache/epub/100/pg100.txt (5.7mb; ~170K lines) > with open('/tmp/shakespeare.txt', 'r') as f: > text = f.read() > import re > from itertools import * > line_re = re.compile(r"\n") > > Then when I run it: > In [25]: %timeit _ = list(accumulate(chain([0], map(len, > text.splitlines(True))))) > 30.4 ms ± 705 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > > In [26]: %timeit _ = [m.start() for m in line_re.finditer(text)] > 29 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) > > This is on 3.10.3 on an Intel 2.3gz i9 Macbook. (Note that the regex is > off-by-one from the splitlines implementation.) > > What benchmark shows the regex to be significantly slower? > > That said, str.indexes(char) sounds like a reasonable addition. > > Best wishes, > Lucas Wiman > > On Fri, Jun 17, 2022 at 1:12 PM Jonathan Slenders <jonat...@slenders.be> > wrote: > >> Hi everyone, >> >> Today was the 3rd time I came across a situation where it was needed to >> retrieve all the positions of the line endings (or beginnings) in a very >> long python string as efficiently as possible. First time, it was needed in >> prompt_toolkit, where I spent a crazy amount of time looking for the most >> performant solution. Second time was in a commercial project where >> performance was very critical too. Third time is for the Rich/Textual >> project from Will McGugan. (See: >> https://twitter.com/willmcgugan/status/1537782771137011715 ) >> >> The problem is that the `str` type doesn't expose any API to efficiently >> find all \n positions. Every Python implementation is either calling >> `.index()` in a loop and collecting the results or running a regex over the >> string and collecting all positions. >> >> For long strings, depending on the implementation, this results in a lot >> of overhead due to either: >> - calling Python functions (or any other Python instruction) for every \n >> character in the input. The amount of executed Python instructions is O(n) >> here. >> - Copying string data into new strings. >> >> The fastest solution I've been using for some time, does this >> (simplified): `accumulate(chain([0], map(len, text.splitlines(True))))`. >> The performance is great here, because the amount of Python instructions is >> O(1). Everything is chained in C-code thanks to itertools. Because of that, >> it can outperform the regex solution with a factor of ~2.5. (Regex isn't >> slow, but iterating over the results is.) >> >> The bad things about this solution is however: >> - Very cumbersome syntax. >> - We call `splitlines()` which internally allocates a huge amount of >> strings, only to use their lengths. That is still much more overhead then a >> simple for-loop in C would be. >> >> Performance matters here, because for these kind of problems, the list of >> integers that gets produced is typically used as an index to quickly find >> character offsets in the original string, depending on which line is >> displayed/processed. The bisect library helps too to quickly convert any >> index position of that string into a line number. The point is, that for >> big inputs, the amount of Python instructions executed is not O(n), but >> O(1). Of course, some of the C code remains O(n). >> >> So, my ask here. >> Would it make sense to add a `line_offsets()` method to `str`? >> Or even `character_offsets(character)` if we want to do that for any >> character? >> Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary >> lengths? >> >> Thanks, >> Jonathan >> >> >> >> >> >> >> _______________________________________________ >> Python-ideas mailing list -- python-ideas@python.org >> To unsubscribe send an email to python-ideas-le...@python.org >> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-ideas@python.org/message/6WAMKYXOYA3SKL5HIRZP4WARMYYKXI3Q/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/2V65TPPA237YMZNLL2TWKO34XVZPXYJR/ Code of Conduct: http://python.org/psf/codeofconduct/