Thanks all for all the responses! That's quite a bit to think about.

A couple of thoughts:

1. First, I do support a transition to UTF-8, so I understand we don't want
to add more methods that deal with character offsets. (I'm familiar with
how strings work in Rust.) However, does that mean we won't be
using/exposing any offset at all, or will it become possible to slice using
byte offsets?

2. The commercial application I mentioned where this is critical is
actually using bytes instead of str. Sorry for not mentioning earlier. We
were doing the following:
    list(accumulate(chain([0], map(len, text.splitlines(True)))))
where text is a bytes object. This is significantly faster than a binary
regex for finding all universal line endings. This application is an
asyncio web app that streams Cisco show-tech files (often several
gigabytes) from a file server over HTTP; stores them chunk by chunk into a
local cache file on disk; and builds a index of byte offsets in the
meantime by running the above expression over every chunk. That way the
client web app can quickly load the lines from disk as the user scrolls
through the file. A very niche application indeed, so use of Cython would
be acceptable in this particular case. I published the relevant snippet
here to be studied:
https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868
It does handle an interesting edge case regarding UTF-16.

3. The code in prompt_toolkit can be found here:
https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209
(It's not yet using 'accumulate' there, but for the rest it's the same.)
Also here, universal line endings support is important, because the editing
buffer can in theory contain a mix of line endings. It has to be
performant, because it executes on every key stroke. In this case, a more
complex data structure could probably solve performance issues here, but
it's really not worth the complexity that it introduces in every text
manipulation (like every key binding). Also try using the "re" library to
search over a list of lines or anything that's not a simple string.

4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 times
faster than re. Imagine if splitlines() doesn't have to do the work to
actually create the substrings, but only has to return the offsets, that
should be even much faster and not require so much memory. (I have an
benchmark that does it one chunk at a time, to prevent using too much
memory:
https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf )

So talking about bytes. Would it be acceptable to have a
`bytes.line_offsets()` method instead? Or
`bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or
not? `str.splitlines(return_offsets=True)` would be very nice, but I
understand the concerns.

It's somewhat frustrating here knowing that for `splitlines()`, the
information is there, already computed, just not immediately accessible.
(without having Python do lots of unnecessary work.)

Jonathan


Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a écrit :

> Hi
>
> This is a nice problem, well presented. Here's four comments / questions.
>
> 1. How does the introduction of faster CPython in Python 3.11 affect the
> benchmarks?
> 2. Is there an across-the-board change that would speedup this
> line-offsets task?
> 3. To limit splitlines memory use (at small performance cost), chunk the
> input string into say 4 kb blocks.
> 4. Perhaps anything done here for strings should also be done for bytes.
>
> --
> Jonathan
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to