On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote:

> What does actual CPython do currently to find that s[1_000_000], assuming
> utf-8 internal representation?

CPython doesn't use a UTF-8 internal representation.

MicroPython *may*, but I don't know if they do anything fancy to avoid 
O(N) indexing.

IronPython and Jython use whatever .Net and Java use.

CPython uses a custom implementation, the Flexible String 
Representation, which picks the smallest code unit size required to 
store all the characters in the string.


    # Pseudo-code
    c = max(string)  # Highest code-point
    if c <= '\xFF':
        # effectively ASCII or Latin-1
        use one byte per code point
    elif c <= '\uFFFF':
        # effectively UCS-2, or UTF-16 without the surregate pairs
        use two bytes per code point
    else:
        assert c <= '\U0001FFFF':
        # effectively UCS-4, or UTF-32
        use four bytes per code point


-- 
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/5ALOHG346WTZ5OFIJPISTZCZR6KDPZQF/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to