[Python-ideas] Re: Add a line_offsets() method to str

Inada Naoki Sat, 18 Jun 2022 18:07:18 -0700

On Sat, Jun 18, 2022 at 5:13 AM Jonathan Slenders <jonat...@slenders.be> wrote:
> First time, it was needed in prompt_toolkit, where I spent a crazy amount of 
> time looking for the most performant solution.
> Third time is for the Rich/Textual project from Will McGugan. (See: 
> https://twitter.com/willmcgugan/status/1537782771137011715 )


Would you give me a pointer to the code?
I want to know its use cases.

>
> The fastest solution I've been using for some time, does this (simplified): 
> `accumulate(chain([0], map(len, text.splitlines(True))))`. The performance is 
> great here, because the amount of Python instructions is O(1). Everything is 
> chained in C-code thanks to itertools. Because of that, it can outperform the 
> regex solution with a factor of ~2.5. (Regex isn't slow, but iterating over 
> the results is.)
>
> The bad things about this solution is however:
> - Very cumbersome syntax.
> - We call `splitlines()` which internally allocates a huge amount of strings, 
> only to use their lengths. That is still much more overhead then a simple 
> for-loop in C would be.
>

FWIW, I had proposed str.iterlines() to fix incompatibility between
IO.readlines() and str.splitlines().
That will be much efficient than splitlines because it doesn't
allocate a huge amount of strings at once. It allocates line string at
a time.
https://discuss.python.org/t/changing-str-splitlines-to-match-file-readlines/174/2

Of course, it will be still slower than your line_offsets() idea
because it still need to allocate line strings many times.

> Performance matters here, because for these kind of problems, the list of 
> integers that gets produced is typically used as an index to quickly find 
> character offsets in the original string, depending on which line is 
> displayed/processed. The bisect library helps too to quickly convert any 
> index position of that string into a line number. The point is, that for big 
> inputs, the amount of Python instructions executed is not O(n), but O(1). Of 
> course, some of the C code remains O(n).
>
> So, my ask here.
> Would it make sense to add a `line_offsets()` method to `str`?
> Or even `character_offsets(character)` if we want to do that for any 
> character?
> Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary 
> lengths?
>

I don't like string offsets so I don't like adding more methods
returning offsets.
Currently, string offset in Python is counted by code points. This is
not efficient for Python implementations using UTF-8 (PyPy and
MicroPython) or UTF-16 (Maybe Jython and IronPython, but I don't know)
internally.

CPython uses PEP 393 for now so offsets are efficient.
But I want to change it to UTF-8 in the future. UTF-8 internal
encoding is much efficient for many Python use cases like:

* Read UTF-8 string from text and write it to UTF-8 console.
* Read UTF-8 from database and write it to UTF-8 JSON.

Additionally, there are many very fast string algorithms work with
UTF-8 written in C or Rust.
Python is glue language. Reducing overhead with such libraries is good
for Python.

For now, my recommendation is using some library written in Cython if
it is performance critical.

Regards,
-- 
Inada Naoki  <songofaca...@gmail.com>
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/57JHIUS5RAOH4IV6NSOYGNVTPAEQTCMC/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Add a line_offsets() method to str

Reply via email to