On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote: > On Oct 13, 2019, at 12:02, Steve Jorgensen <ste...@stevej.name> wrote: [...] > > This proposal is a serious breakage of backward compatibility, so > > would be something for Python 4.x, not 3.x. > > I’m pretty sure almost nobody wants a 3.0-like break again, so this > will probably never happen.
Indeed, and Guido did rule some time ago that 4.0 would be ordinary transition, like 3.7 to 3.8, not a big backwards breaking version change. I've taken up referring to some hypothetical future 3.0-like version as Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just how far away it will be. > And finally, if you want to break strings, it’s probably worth at > least considering making UTF-8 strings first-class objects. They can’t > be randomly accessed, I don't see why you can't make arrays of UTF-8 indexable and provide random access to any code point. I understand that ``str`` in Micropython is implemented that way. The obvious implementation means that you lose O(1) indexing (to reach the N-th code point, you have to count from the beginning each time) but save memory over other encodings. (At worst, a code-point in UTF-8 takes three bytes, compared to four in UTF-16 or UTF-32.) There are ways to get back O(1) indexing, but they cost more memory. But why would you want an explicit UTF-8 string object? What benefit do you get from exposing the fact that the implementation happens to be UTF-8 rather than something else? (Not rhetorical questions.) If the UTF-8 object operates on the basis of Unicode code points, then its just a str, and the implementation is just an implementation detail. If the UTF-8 object operates on the basis of raw bytes, with no protection against malformed UTF-8 (e.g. allowing you to insert bytes 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- or three-byte UTF-8 sequence) then its just a bytes object (or bytearray) initialised with a UTF-8 sequence. That is, as I understand it, what languages like Go do. To paraphrase, they offer data types they *call* UTF-8 strings, except that they can contain arbitrary bytes and be invalid UTF-8. We can already do this, today, without the deeply misleading name: string.encode('utf-8') and then work with the bytes. I think this is even quite efficient in CPython's "Flexible string representation". For ASCII-only strings, the UTF-8 encoding uses the same storage as the original ASCII bytes. For others, the UTF-8 representation is cached for later use. So I don't see any advantage to this UTF-8 object. If the API works on code points, then it's just an implementation detail of str; if the API works on code units, that's just a fancy name for bytes. We already have both str and bytes so what is the purpose of this utf8 object? -- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/RKY73YB2UVJMZ2PNIYJ74AFVKUAIK45K/ Code of Conduct: http://python.org/psf/codeofconduct/