On Sat, Oct 26, 2019 at 11:34:34PM -0400, David Mertz wrote: > What does actual CPython do currently to find that s[1_000_000], assuming > utf-8 internal representation?
CPython doesn't use a UTF-8 internal representation. MicroPython *may*, but I don't know if they do anything fancy to avoid O(N) indexing. IronPython and Jython use whatever .Net and Java use. CPython uses a custom implementation, the Flexible String Representation, which picks the smallest code unit size required to store all the characters in the string. # Pseudo-code c = max(string) # Highest code-point if c <= '\xFF': # effectively ASCII or Latin-1 use one byte per code point elif c <= '\uFFFF': # effectively UCS-2, or UTF-16 without the surregate pairs use two bytes per code point else: assert c <= '\U0001FFFF': # effectively UCS-4, or UTF-32 use four bytes per code point -- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/5ALOHG346WTZ5OFIJPISTZCZR6KDPZQF/ Code of Conduct: http://python.org/psf/codeofconduct/