[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Steven D'Aprano Sat, 26 Oct 2019 16:29:59 -0700

On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas wrote:
> On Oct 13, 2019, at 12:02, Steve Jorgensen <ste...@stevej.name> wrote:
[...]
> > This proposal is a serious breakage of backward compatibility, so 
> > would be something for Python 4.x, not 3.x.
> 
> I’m pretty sure almost nobody wants a 3.0-like break again, so this 
> will probably never happen.

Indeed, and Guido did rule some time ago that 4.0 would be ordinary
transition, like 3.7 to 3.8, not a big backwards breaking version
change.

I've taken up referring to some hypothetical future 3.0-like version as
Python 5000 (not 4000) in analogy to Python 3000, but to emphasise just
how far away it will be.

> And finally, if you want to break strings, it’s probably worth at
> least considering making UTF-8 strings first-class objects. They can’t
> be randomly accessed,

I don't see why you can't make arrays of UTF-8 indexable and provide
random access to any code point. I understand that ``str`` in
Micropython is implemented that way.

The obvious implementation means that you lose O(1) indexing (to reach
the N-th code point, you have to count from the beginning each time) but
save memory over other encodings. (At worst, a code-point in UTF-8 takes
three bytes, compared to four in UTF-16 or UTF-32.) There are ways to
get back O(1) indexing, but they cost more memory.

But why would you want an explicit UTF-8 string object? What benefit
do you get from exposing the fact that the implementation happens to be
UTF-8 rather than something else? (Not rhetorical questions.)

If the UTF-8 object operates on the basis of Unicode code points, then
its just a str, and the implementation is just an implementation detail.

If the UTF-8 object operates on the basis of raw bytes, with no
protection against malformed UTF-8 (e.g. allowing you to insert bytes
0x80-0xFF which are never valid in UTF-8, or by splitting apart a two-
or three-byte UTF-8 sequence) then its just a bytes object (or
bytearray) initialised with a UTF-8 sequence.

That is, as I understand it, what languages like Go do. To paraphrase,
they offer data types they *call* UTF-8 strings, except that they can
contain arbitrary bytes and be invalid UTF-8. We can already do this,
today, without the deeply misleading name:

string.encode('utf-8')

and then work with the bytes. I think this is even quite efficient in
CPython's "Flexible string representation". For ASCII-only strings, the
UTF-8 encoding uses the same storage as the original ASCII bytes. For
others, the UTF-8 representation is cached for later use.

So I don't see any advantage to this UTF-8 object. If the API works on
code points, then it's just an implementation detail of str; if the API
works on code units, that's just a fancy name for bytes. We already have
both str and bytes so what is the purpose of this utf8 object?

--
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/RKY73YB2UVJMZ2PNIYJ74AFVKUAIK45K/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to