On 2022-06-20 16:12, Christopher Barker wrote:
Hmm - I’m a bit confused about how you handle mixed / multiple line
endings. If you use splitlines(), then it will remove the line endings,
so if there are two-char line endings, then you’ll get off by one
errors, yes?
I would think you could look for “\n”, and get the correct answer ( with
extraneous “\r”s in the substrings…
-CHB
How about something like .split, but returning the spans instead of the
strings?
On Mon, Jun 20, 2022 at 5:04 PM Christopher Barker <python...@gmail.com
<mailto:python...@gmail.com>> wrote:
If you are working with bytes, then numpy could be perfect— not a
small dependency of course, but it should work, and work fast.
And a cython method would be quite easy to write, but of course
substantially harder to distribute :-(
-CHB
On Sun, Jun 19, 2022 at 5:30 PM Jonathan Slenders
<jonat...@slenders.be <mailto:jonat...@slenders.be>> wrote:
Thanks all for all the responses! That's quite a bit to think about.
A couple of thoughts:
1. First, I do support a transition to UTF-8, so I understand we
don't want to add more methods that deal with character offsets.
(I'm familiar with how strings work in Rust.) However, does that
mean we won't be using/exposing any offset at all, or will it
become possible to slice using byte offsets?
2. The commercial application I mentioned where this is critical
is actually using bytes instead of str. Sorry for not mentioning
earlier. We were doing the following:
list(accumulate(chain([0], map(len, text.splitlines(True)))))
where text is a bytes object. This is significantly faster than
a binary regex for finding all universal line endings. This
application is an asyncio web app that streams Cisco show-tech
files (often several gigabytes) from a file server over HTTP;
stores them chunk by chunk into a local cache file on disk; and
builds a index of byte offsets in the meantime by running the
above expression over every chunk. That way the client web app
can quickly load the lines from disk as the user scrolls through
the file. A very niche application indeed, so use of Cython
would be acceptable in this particular case. I published the
relevant snippet here to be studied:
https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868
<https://gist.github.com/jonathanslenders/59ddf8fe2a0954c7f1865fba3b151868>
It does handle an interesting edge case regarding UTF-16.
3. The code in prompt_toolkit can be found here:
https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209
<https://github.com/prompt-toolkit/python-prompt-toolkit/blob/master/src/prompt_toolkit/document.py#L209>
(It's not yet using 'accumulate' there, but for the rest it's
the same.) Also here, universal line endings support is
important, because the editing buffer can in theory contain a
mix of line endings. It has to be performant, because it
executes on every key stroke. In this case, a more complex data
structure could probably solve performance issues here, but it's
really not worth the complexity that it introduces in every text
manipulation (like every key binding). Also try using the "re"
library to search over a list of lines or anything that's not a
simple string.
4. I tested on 3.11.0b3. Using the splitlines() approach is
still 2.5 times faster than re. Imagine if splitlines() doesn't
have to do the work to actually create the substrings, but only
has to return the offsets, that should be even much faster and
not require so much memory. (I have an benchmark that does it
one chunk at a time, to prevent using too much memory:
https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf
<https://gist.github.com/jonathanslenders/bfca8e4f318ca64e718b4085a737accf>
)
So talking about bytes. Would it be acceptable to have a
`bytes.line_offsets()` method instead? Or
`bytes.splitlines(return_offsets=True)`? Because byte offsets
are okay, or not? `str.splitlines(return_offsets=True)` would be
very nice, but I understand the concerns.
It's somewhat frustrating here knowing that for `splitlines()`,
the information is there, already computed, just not immediately
accessible. (without having Python do lots of unnecessary work.)
Jonathan
Le dim. 19 juin 2022 à 15:34, Jonathan Fine <jfine2...@gmail.com
<mailto:jfine2...@gmail.com>> a écrit :
Hi
This is a nice problem, well presented. Here's four comments
/ questions.
1. How does the introduction of faster CPython in Python
3.11 affect the benchmarks?
2. Is there an across-the-board change that would speedup
this line-offsets task?
3. To limit splitlines memory use (at small performance
cost), chunk the input string into say 4 kb blocks.
4. Perhaps anything done here for strings should also be
done for bytes.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/O6JLWVMENV47FIKRVLPC26KV45STMY3T/
Code of Conduct: http://python.org/psf/codeofconduct/