On Sat, Jun 18, 2022 at 5:13 AM Jonathan Slenders <jonat...@slenders.be> wrote: > First time, it was needed in prompt_toolkit, where I spent a crazy amount of > time looking for the most performant solution. > Third time is for the Rich/Textual project from Will McGugan. (See: > https://twitter.com/willmcgugan/status/1537782771137011715 )
Would you give me a pointer to the code? I want to know its use cases. > > The fastest solution I've been using for some time, does this (simplified): > `accumulate(chain([0], map(len, text.splitlines(True))))`. The performance is > great here, because the amount of Python instructions is O(1). Everything is > chained in C-code thanks to itertools. Because of that, it can outperform the > regex solution with a factor of ~2.5. (Regex isn't slow, but iterating over > the results is.) > > The bad things about this solution is however: > - Very cumbersome syntax. > - We call `splitlines()` which internally allocates a huge amount of strings, > only to use their lengths. That is still much more overhead then a simple > for-loop in C would be. > FWIW, I had proposed str.iterlines() to fix incompatibility between IO.readlines() and str.splitlines(). That will be much efficient than splitlines because it doesn't allocate a huge amount of strings at once. It allocates line string at a time. https://discuss.python.org/t/changing-str-splitlines-to-match-file-readlines/174/2 Of course, it will be still slower than your line_offsets() idea because it still need to allocate line strings many times. > Performance matters here, because for these kind of problems, the list of > integers that gets produced is typically used as an index to quickly find > character offsets in the original string, depending on which line is > displayed/processed. The bisect library helps too to quickly convert any > index position of that string into a line number. The point is, that for big > inputs, the amount of Python instructions executed is not O(n), but O(1). Of > course, some of the C code remains O(n). > > So, my ask here. > Would it make sense to add a `line_offsets()` method to `str`? > Or even `character_offsets(character)` if we want to do that for any > character? > Or `indexes(...)/indices(...)` if we would allow substrings of arbitrary > lengths? > I don't like string offsets so I don't like adding more methods returning offsets. Currently, string offset in Python is counted by code points. This is not efficient for Python implementations using UTF-8 (PyPy and MicroPython) or UTF-16 (Maybe Jython and IronPython, but I don't know) internally. CPython uses PEP 393 for now so offsets are efficient. But I want to change it to UTF-8 in the future. UTF-8 internal encoding is much efficient for many Python use cases like: * Read UTF-8 string from text and write it to UTF-8 console. * Read UTF-8 from database and write it to UTF-8 JSON. Additionally, there are many very fast string algorithms work with UTF-8 written in C or Rust. Python is glue language. Reducing overhead with such libraries is good for Python. For now, my recommendation is using some library written in Cython if it is performance critical. Regards, -- Inada Naoki <songofaca...@gmail.com> _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/57JHIUS5RAOH4IV6NSOYGNVTPAEQTCMC/ Code of Conduct: http://python.org/psf/codeofconduct/