Hmm - I’m a bit confused about how you handle mixed / multiple line endings. If you use splitlines(), then it will remove the line endings, so if there are two-char line endings, then you’ll get off by one errors, yes?
I would think you could look for “\n”, and get the correct answer ( with extraneous “\r”s in the substrings… -CHB On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <python...@gmail.com> wrote: > If you are working with bytes, then numpy could be perfect— not a small > dependency of course, but it should work, and work fast. > > And a cython method would be quite easy to write, but of course > substantially harder to distribute :-( > > -CHB > > On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonat...@slenders.be> > wrote: > >> Thanks all for all the responses! That's quite a bit to think about. >> >> A couple of thoughts: >> >> 1. First, I do support a transition to UTF-8, so I understand we don't >> want to add more methods that deal with character offsets. (I'm familiar >> with how strings work in Rust.) However, does that mean we won't be >> using/exposing any offset at all, or will it become possible to slice using >> byte offsets? >> >> 2. The commercial application I mentioned where this is critical is >> actually using bytes instead of str. Sorry for not mentioning earlier. We >> were doing the following: >> list(accumulate(chain([0], map(len, text.splitlines(True))))) >> where text is a bytes object. This is significantly faster than a binary >> regex for finding all universal line endings. This application is an >> asyncio web app that streams Cisco show-tech files (often several >> gigabytes) from a file server over HTTP; stores them chunk by chunk into a >> local cache file on disk; and builds a index of byte offsets in the >> meantime by running the above expression over every chunk. That way the >> client web app can quickly load the lines from disk as the user scrolls >> through the file. A very niche application indeed, so use of Cython would >> be acceptable in this particular case. I published the relevant snippet >> here to be studied: >> https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868 >> It does handle an interesting edge case regarding UTF-16. >> >> 3. The code in prompt_toolkit can be found here: >> https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209 >> (It's not yet using 'accumulate' there, but for the rest it's the same.) >> Also here, universal line endings support is important, because the editing >> buffer can in theory contain a mix of line endings. It has to be >> performant, because it executes on every key stroke. In this case, a more >> complex data structure could probably solve performance issues here, but >> it's really not worth the complexity that it introduces in every text >> manipulation (like every key binding). Also try using the "re" library to >> search over a list of lines or anything that's not a simple string. >> >> 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5 >> times faster than re. Imagine if splitlines() doesn't have to do the work >> to actually create the substrings, but only has to return the offsets, that >> should be even much faster and not require so much memory. (I have an >> benchmark that does it one chunk at a time, to prevent using too much >> memory: >> https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf >> ) >> >> So talking about bytes. Would it be acceptable to have a >> `bytes.line_offsets()` method instead? Or >> `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or >> not? `str.splitlines(return_offsets=True)` would be very nice, but I >> understand the concerns. >> >> It's somewhat frustrating here knowing that for `splitlines()`, the >> information is there, already computed, just not immediately accessible. >> (without having Python do lots of unnecessary work.) >> >> Jonathan >> >> >> Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a >> écrit : >> >>> Hi >>> >>> This is a nice problem, well presented. Here's four comments / questions. >>> >>> 1. How does the introduction of faster CPython in Python 3.11 affect the >>> benchmarks? >>> 2. Is there an across-the-board change that would speedup this >>> line-offsets task? >>> 3. To limit splitlines memory use (at small performance cost), chunk the >>> input string into say 4 kb blocks. >>> 4. Perhaps anything done here for strings should also be done for bytes. >>> >>> -- >>> Jonathan >>> _______________________________________________ >>> Python-ideas mailing list -- python-ideas@python.org >>> To unsubscribe send an email to python-ideas-le...@python.org >>> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >>> Message archived at >>> https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/ >>> Code of Conduct: http://python.org/psf/codeofconduct/ >>> >> _______________________________________________ >> Python-ideas mailing list -- python-ideas@python.org >> To unsubscribe send an email to python-ideas-le...@python.org >> https://mail.python.org/mailman3/lists/python-ideas.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > -- > Christopher Barker, PhD (Chris) > > Python Language Consulting > - Teaching > - Scientific Software Development > - Desktop GUI and Web Development > - wxPython, numpy, scipy, Cython > -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/MZHVUDCRNQ7AI6MPFCQAVIJLCRVUKMJ4/ Code of Conduct: http://python.org/psf/codeofconduct/