[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Steven D'Aprano Sat, 26 Oct 2019 09:43:48 -0700

On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote:

> Nothing good can come of decomposing strings into Unicode code points.

Sure there is. In Python, it's the fastest way to calculate the digit
sum of an integer. It's also useful for implementing classical
encryption algorithms, like Playfair.

Introspection, e.g. if I want to know if a string contains any
surrogates, I can do this:

any('\uD800' <= c <= '\uDFFF' for c in s)

Of perhaps I want to know if the string contains any "astral
characters", in which case they aren't safe to pass to a Javascript or
Tcl script which doesn't handle them correctly:

any(c > '\uFFFF' for c in s)

How about education? One of the things I can do with strings is:

for c in string:
print(unicodedata.name(c))

or possible even just

# what is that weird symbol in position five?
print(unicodedata.name(string[5]))

to find out what that weird character is called, so I can look it up and
find out what it means. Knowing stuff is good, right?

Or do you think the world would be better off if it was really hard
and "ugly" (your word) for people like me to find out what code points
are called and what their meaning is?

Rather than just telling us that we shouldn't be allowed to access code
points in strings, would you please be explicit about *why* this access
is a bad thing?

And if code points are "bad", then what should we be allowed to do with
strings? If code points is too low level, then what is an appropriate
level?

I guess you're probably going to mention grapheme clusters. (If you
aren't, then I have no idea what your objection is based on.)

Grapheme clusters are a hard problem to solve, since they are dependent
on the language and the locale. There's a Unicode algorithm for
splitting on graphemes, but it ignores the locale differences.

Processing on graphemes is more expensive than on code points. There is,
as far as I can tell, no O(1) access to graphemes in a string without
pre-processing them and keeping a list of their indices.

For many people, and for many purposes, paying that extra cost in either
time or memory is just a total waste, since they're hardly ever going to
come across a grapheme cluster. Few people have to process completely
arbitrary strings: their data tends to come from a particular subset of
natural language strings, and for some such languages, you might go a
whole lifetime without coming across a grapheme cluster of more than one
code point.

(This may be slowly changing, even for American English, driven in part
by the use of emoji and variation selectors.)

If Python came with a grapheme processing API, I would probably use it.
But in the meantime, the code point API is "good enough" for most things
I do with strings. And for the rest, graphemes are too low-level: I need
things like sentences; clauses, words, word stems, prefixes and
suffixes, syllables etc.

But even if Python had an excellent, fast grapheme API, I would still
want a nice, clean, fast interface that operates on code-points.

--
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/OCG64OW4WPVDFUSN3R7AGI6M4NFKGJIP/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Python 4000: Have stringlike objects provide sequence views rather than being sequences

Reply via email to