Hmm - I’m a bit confused about how you handle mixed / multiple line
endings. If you use splitlines(), then it will remove the line endings, so
if there are two-char line endings, then you’ll get off by one errors, yes?

I would think you could look for “\n”, and get the correct answer ( with
extraneous “\r”s in the substrings…

-CHB

On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <python...@gmail.com>
wrote:

> If you are working with bytes, then numpy could be perfect— not a small
> dependency of course, but it should work, and work fast.
>
> And a cython method would be quite easy to write, but of course
> substantially harder to distribute :-(
>
> -CHB
>
> On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders <jonat...@slenders.be>
> wrote:
>
>> Thanks all for all the responses! That's quite a bit to think about.
>>
>> A couple of thoughts:
>>
>> 1. First, I do support a transition to UTF-8, so I understand we don't
>> want to add more methods that deal with character offsets. (I'm familiar
>> with how strings work in Rust.) However, does that mean we won't be
>> using/exposing any offset at all, or will it become possible to slice using
>> byte offsets?
>>
>> 2. The commercial application I mentioned where this is critical is
>> actually using bytes instead of str. Sorry for not mentioning earlier. We
>> were doing the following:
>>     list(accumulate(chain([0], map(len, text.splitlines(True)))))
>> where text is a bytes object. This is significantly faster than a binary
>> regex for finding all universal line endings. This application is an
>> asyncio web app that streams Cisco show-tech files (often several
>> gigabytes) from a file server over HTTP; stores them chunk by chunk into a
>> local cache file on disk; and builds a index of byte offsets in the
>> meantime by running the above expression over every chunk. That way the
>> client web app can quickly load the lines from disk as the user scrolls
>> through the file. A very niche application indeed, so use of Cython would
>> be acceptable in this particular case. I published the relevant snippet
>> here to be studied:
>> https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868
>> It does handle an interesting edge case regarding UTF-16.
>>
>> 3. The code in prompt_toolkit can be found here:
>> https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209
>> (It's not yet using 'accumulate' there, but for the rest it's the same.)
>> Also here, universal line endings support is important, because the editing
>> buffer can in theory contain a mix of line endings. It has to be
>> performant, because it executes on every key stroke. In this case, a more
>> complex data structure could probably solve performance issues here, but
>> it's really not worth the complexity that it introduces in every text
>> manipulation (like every key binding). Also try using the "re" library to
>> search over a list of lines or anything that's not a simple string.
>>
>> 4. I tested on 3.11.0b3. Using the splitlines() approach is still 2.5
>> times faster than re. Imagine if splitlines() doesn't have to do the work
>> to actually create the substrings, but only has to return the offsets, that
>> should be even much faster and not require so much memory. (I have an
>> benchmark that does it one chunk at a time, to prevent using too much
>> memory:
>> https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf
>> )
>>
>> So talking about bytes. Would it be acceptable to have a
>> `bytes.line_offsets()` method instead? Or
>> `bytes.splitlines(return_offsets=True)`? Because byte offsets are okay, or
>> not? `str.splitlines(return_offsets=True)` would be very nice, but I
>> understand the concerns.
>>
>> It's somewhat frustrating here knowing that for `splitlines()`, the
>> information is there, already computed, just not immediately accessible.
>> (without having Python do lots of unnecessary work.)
>>
>> Jonathan
>>
>>
>> Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com> a
>> écrit :
>>
>>> Hi
>>>
>>> This is a nice problem, well presented. Here's four comments / questions.
>>>
>>> 1. How does the introduction of faster CPython in Python 3.11 affect the
>>> benchmarks?
>>> 2. Is there an across-the-board change that would speedup this
>>> line-offsets task?
>>> 3. To limit splitlines memory use (at small performance cost), chunk the
>>> input string into say 4 kb blocks.
>>> 4. Perhaps anything done here for strings should also be done for bytes.
>>>
>>> --
>>> Jonathan
>>> _______________________________________________
>>> Python-ideas mailing list -- python-ideas@python.org
>>> To unsubscribe send an email to python-ideas-le...@python.org
>>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>>> Message archived at
>>> https://mail.python.org/archives/list/python-ideas@python.org/message/AETGT5HDF3QOFODOWKB4X45ZE4CZ7Y3M/
>>> Code of Conduct: http://python.org/psf/codeofconduct/
>>>
>> _______________________________________________
>> Python-ideas mailing list -- python-ideas@python.org
>> To unsubscribe send an email to python-ideas-le...@python.org
>> https://mail.python.org/mailman3/lists/python-ideas.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-ideas@python.org/message/FZ7V4FFKR45YLQDHTD2JZYEWZ5HEI3P2/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> --
> Christopher Barker, PhD (Chris)
>
> Python Language Consulting
>   - Teaching
>   - Scientific Software Development
>   - Desktop GUI and Web Development
>   - wxPython, numpy, scipy, Cython
>
-- 
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/MZHVUDCRNQ7AI6MPFCQAVIJLCRVUKMJ4/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to