[Python-ideas] Re: Add a line_offsets() method to str

Steve Jorgensen Mon, 20 Jun 2022 11:00:15 -0700

Steve Jorgensen wrote:
> Jonathan Slenders wrote:
> > Hi everyone,
> > Today was the 3rd time I came across a situation where it was needed to
> > retrieve all the positions of the line endings (or beginnings) in a very
> > long python string as efficiently as possible. First time, it was needed in
> > prompt_toolkit, where I spent a crazy amount of time looking for the most
> > performant solution. Second time was in a commercial project where
> > performance was very critical too. Third time is for the Rich/Textual
> > project from Will McGugan. (See:
> > https://twitter.com/willmcgugan/status/1537782771137011715 )
> > The problem is that the `str` type doesn't expose any API to efficiently
> > find all \n positions. Every Python implementation is either calling
> > `.index()` in a loop and collecting the results or running a regex over the
> > string and collecting all positions.
> > For long strings, depending on the implementation, this results in a lot of
> > overhead due to either:
> > 
> > calling Python functions (or any other Python instruction) for every \n
> > 
> > character in the input. The amount of executed Python instructions is O(n)
> > here.
> > 
> > Copying string data into new strings.
> > 
> > The fastest solution I've been using for some time, does this (simplified):
> > `accumulate(chain([0], map(len, text.splitlines(True))))`. The performance
> > is great here, because the amount of Python instructions is O(1).
> > Everything is chained in C-code thanks to itertools. Because of that, it
> > can outperform the regex solution with a factor of ~2.5. (Regex isn't slow,
> > but iterating over the results is.)
> > The bad things about this solution is however:
> > 
> > Very cumbersome syntax.
> > We call `splitlines()` which internally allocates a huge amount of
> > 
> > strings, only to use their lengths. That is still much more overhead then a
> > simple for-loop in C would be.
> > Performance matters here, because for these kind of problems, the list of
> > integers that gets produced is typically used as an index to quickly find
> > character offsets in the original string, depending on which line is
> > displayed/processed. The bisect library helps too to quickly convert any
> > index position of that string into a line number. The point is, that for
> > big inputs, the amount of Python instructions executed is not O(n), but
> > O(1). Of course, some of the C code remains O(n).
> > So, my ask here.
> > Would it make sense to add a `line_offsets()` method to `str`?
> > Or even `character_offsets(character)` if we want to do that for any
> > character?
> > Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary
> > lengths?
> > Thanks,
> > Jonathan
> > I presume there is some reason that `re.findall` did not work or was not 
> > optimal?


I just saw your reply elsewhere in the conversation that says

> That requires a more complex regex pattern. I was actually using:
> re.compile(r"\n|\r(?!\n)")
> And then the regex becomes significantly slower than the splitlines() 
> solution, which is still much slower than it has to be.
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/JGY2YNOCKZ2KS7BMQMNCEY3YHIRJC3UL/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a line_offsets() method to str

Reply via email to