Thanks all for all the responses! That's quite a bit to think about. A couple of thoughts:
1. First, I do support a transition to UTF-8, so I understand we don't want to add more methods that deal with character offsets. (I'm familiar with how strings work in Rust.) However, does that mean we won't be using/exposing any offset at all, or will it become possible to slice using byte offsets? 2. The commercial application I mentioned where this is critical is actually using bytes instead of str. Sorry for not mentioning earlier. We were doing the following: list(accumulate(chain([0], map(len, text.splitlines(True))))) where text is a bytes object. This is significantly faster than a binary regex for finding all universal line endings. This application is an asyncio web app that streams Cisco show-tech files (often several gigabytes) from a file server over HTTP; stores them chunk by chunk into a local cache file on disk; and builds a index of byte offsets in the meantime by running the above expression over every chunk. That way the client web app can quickly load the lines from disk as the user scrolls through the file. A very niche application indeed, so use of Cython would be acceptable in this particular case. I published the relevant snippet here to be studied: https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 It does handle an interesting edge case regarding UTF-16. 3. The code in prompt_toolkit can be found here: https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209 (It's not yet using 'accumulate' there, but for the rest it's the same.) Also here, universal line endings support is important, because the editing buffer can in theory contain a mix of line endings. It has to be performant, because it executes on every key stroke. In this case, a more complex data structure could probably solve performance issues here, but it's really not worth the complexity that it introduces in every text manipulation (like every key binding). Also try using the "re" library to search over a list of lines or anything that's not a simple string. 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 times faster than re. Imagine if splitlines() doesn't have to do the work to actually create the substrings, but only has to return the offsets, that should be even much faster and not require so much memory. (I have an benchmark that does it one chunk at a time, to prevent using too much memory: https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf ) So talking about bytes. Would it be acceptable to have a `bytes.line_offsets()` method instead? Or `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or not? `str.splitlines(return_offsets=True)` would be very nice, but I understand the concerns. It's somewhat frustrating here knowing that for `splitlines()`, the information is there, already computed, just not immediately accessible. (without having Python do lots of unnecessary work.) Jonathan Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a écrit : > Hi > > This is a nice problem, well presented. Here's four comments / questions. > > 1. How does the introduction of faster CPython in Python 3.11 affect the > benchmarks? > 2. Is there an across-the-board change that would speedup this > line-offsets task? > 3. To limit splitlines memory use (at small performance cost), chunk the > input string into say 4 kb blocks. > 4. Perhaps anything done here for strings should also be done for bytes. > > -- > Jonathan > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/ > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/ Code of Conduct: http://python.org/psf/codeofconduct/