If you are working with bytes, then numpy could be perfect— not a small dependency of course, but it should work, and work fast.
And a cython method would be quite easy to write, but of course substantially harder to distribute :-( -CHB On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonat...@slenders.be> wrote: > Thanks all for all the responses! That's quite a bit to think about. > > A couple of thoughts: > > 1. First, I do support a transition to UTF-8, so I understand we don't > want to add more methods that deal with character offsets. (I'm familiar > with how strings work in Rust.) However, does that mean we won't be > using/exposing any offset at all, or will it become possible to slice using > byte offsets? > > 2. The commercial application I mentioned where this is critical is > actually using bytes instead of str. Sorry for not mentioning earlier. We > were doing the following: > list(accumulate(chain([0], map(len, text.splitlines(True))))) > where text is a bytes object. This is significantly faster than a binary > regex for finding all universal line endings. This application is an > asyncio web app that streams Cisco show-tech files (often several > gigabytes) from a file server over HTTP; stores them chunk by chunk into a > local cache file on disk; and builds a index of byte offsets in the > meantime by running the above expression over every chunk. That way the > client web app can quickly load the lines from disk as the user scrolls > through the file. A very niche application indeed, so use of Cython would > be acceptable in this particular case. I published the relevant snippet > here to be studied: > https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 > It does handle an interesting edge case regarding UTF-16. > > 3. The code in prompt_toolkit can be found here: > https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209 > (It's not yet using 'accumulate' there, but for the rest it's the same.) > Also here, universal line endings support is important, because the editing > buffer can in theory contain a mix of line endings. It has to be > performant, because it executes on every key stroke. In this case, a more > complex data structure could probably solve performance issues here, but > it's really not worth the complexity that it introduces in every text > manipulation (like every key binding). Also try using the "re" library to > search over a list of lines or anything that's not a simple string. > > 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 > times faster than re. Imagine if splitlines() doesn't have to do the work > to actually create the substrings, but only has to return the offsets, that > should be even much faster and not require so much memory. (I have an > benchmark that does it one chunk at a time, to prevent using too much > memory: > https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf > ) > > So talking about bytes. Would it be acceptable to have a > `bytes.line_offsets()` method instead? Or > `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or > not? `str.splitlines(return_offsets=True)` would be very nice, but I > understand the concerns. > > It's somewhat frustrating here knowing that for `splitlines()`, the > information is there, already computed, just not immediately accessible. > (without having Python do lots of unnecessary work.) > > Jonathan > > > Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a > écrit : > >> Hi >> >> This is a nice problem, well presented. Here's four comments / questions. >> >> 1. How does the introduction of faster CPython in Python 3.11 affect the >> benchmarks? >> 2. Is there an across-the-board change that would speedup this >> line-offsets task? >> 3. To limit splitlines memory use (at small performance cost), chunk the >> input string into say 4 kb blocks. >> 4. Perhaps anything done here for strings should also be done for bytes. >> >> -- >> Jonathan >> _______________________________________________ >> Python-ideas mailing list -- python-ideas@python.org >> To unsubscribe send an email to python-ideas-le...@python.org >> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > _______________________________________________ > Python-ideas mailing list -- python-ideas@python.org > To unsubscribe send an email to python-ideas-le...@python.org > https://mail.python.org/mailman3/lists/python-ideas.python.org/ > Message archived at > https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/ > Code of Conduct: http://python.org/psf/codeofconduct/ > -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/LFAJR6U737OA2UB6SKOPJZCOTPZLGV2A/ Code of Conduct: http://python.org/psf/codeofconduct/