If you are working with bytes, then numpy could be perfect— not a small
dependency of course, but it should work, and work fast.

And a cython method would be quite easy to write, but of course
substantially harder to distribute :-(

-CHB

On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonat...@slenders.be>
wrote:

> Thanks all for all the responses! That's quite a bit to think about.
>
> A couple of thoughts:
>
> 1. First, I do support a transition to UTF-8, so I understand we don't
> want to add more methods that deal with character offsets. (I'm familiar
> with how strings work in Rust.) However, does that mean we won't be
> using/exposing any offset at all, or will it become possible to slice using
> byte offsets?
>
> 2. The commercial application I mentioned where this is critical is
> actually using bytes instead of str. Sorry for not mentioning earlier. We
> were doing the following:
>     list(accumulate(chain([0], map(len, text.splitlines(True)))))
> where text is a bytes object. This is significantly faster than a binary
> regex for finding all universal line endings. This application is an
> asyncio web app that streams Cisco show-tech files (often several
> gigabytes) from a file server over HTTP; stores them chunk by chunk into a
> local cache file on disk; and builds a index of byte offsets in the
> meantime by running the above expression over every chunk. That way the
> client web app can quickly load the lines from disk as the user scrolls
> through the file. A very niche application indeed, so use of Cython would
> be acceptable in this particular case. I published the relevant snippet
> here to be studied:
> https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868
> It does handle an interesting edge case regarding UTF-16.
>
> 3. The code in prompt_toolkit can be found here:
> https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209
> (It's not yet using 'accumulate' there, but for the rest it's the same.)
> Also here, universal line endings support is important, because the editing
> buffer can in theory contain a mix of line endings. It has to be
> performant, because it executes on every key stroke. In this case, a more
> complex data structure could probably solve performance issues here, but
> it's really not worth the complexity that it introduces in every text
> manipulation (like every key binding). Also try using the "re" library to
> search over a list of lines or anything that's not a simple string.
>
> 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5
> times faster than re. Imagine if splitlines() doesn't have to do the work
> to actually create the substrings, but only has to return the offsets, that
> should be even much faster and not require so much memory. (I have an
> benchmark that does it one chunk at a time, to prevent using too much
> memory:
> https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf
> )
>
> So talking about bytes. Would it be acceptable to have a
> `bytes.line_offsets()` method instead? Or
> `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or
> not? `str.splitlines(return_offsets=True)` would be very nice, but I
> understand the concerns.
>
> It's somewhat frustrating here knowing that for `splitlines()`, the
> information is there, already computed, just not immediately accessible.
> (without having Python do lots of unnecessary work.)
>
> Jonathan
>
>
> Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a
> écrit :
>
>> Hi
>>
>> This is a nice problem, well presented. Here's four comments / questions.
>>
>> 1. How does the introduction of faster CPython in Python 3.11 affect the
>> benchmarks?
>> 2. Is there an across-the-board change that would speedup this
>> line-offsets task?
>> 3. To limit splitlines memory use (at small performance cost), chunk the
>> input string into say 4 kb blocks.
>> 4. Perhaps anything done here for strings should also be done for bytes.
>>
>> --
>> Jonathan
>> _______________________________________________
>> Python-ideas mailing list -- python-ideas@python.org
>> To unsubscribe send an email to python-ideas-le...@python.org
>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> _______________________________________________
> Python-ideas mailing list -- python-ideas@python.org
> To unsubscribe send an email to python-ideas-le...@python.org
> https://mail.python.org/mailman3/lists/python-ideas.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
-- 
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/LFAJR6U737OA2UB6SKOPJZCOTPZLGV2A/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to