Re: Grapheme clusters, a.k.a.real characters

Marko Rauhamaa Fri, 14 Jul 2017 06:38:29 -0700

Steve D'Aprano <[email protected]>:

> These are only a *few* of the *easy* questions that need to be
> answered before we can even consider your question:
>
>> So the question is, should we have a third type for text. Or should
>> the semantics of strings be changed to be based on characters?


Sure, but if they can't be answered, what good is there in having
strings (as opposed to bytes). What problem do strings solve? What
operation depends on (or is made simpler) by having strings (instead of
bytes)?

We are not even talking about some exotic languages, but the problem is
right there in the middle of Latin-1. We can't even say what

    len("è")

should return. And we may experience:

    >>> ord("è")Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: ord() expected a character, but string of length 2 found

Of course, UTF-8 in a bytes object doesn't make the situation any
better, but does it make it any worse?

As it stands, we have

   è --[encode>-- Unicode --[reencode>-- UTF-8

Why is one encoding format better than the other?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Grapheme clusters, a.k.a.real characters

Reply via email to