On Wed, May 25, 2022 at 06:16:50PM +0900, Stephen J. Turnbull wrote:
> mguin...@gmail.com writes:
> 
>  > There should be a safer abstraction to these two basic functions.
> 
> There is: TextIOBase.read, then treat it as an array of code units
> (NOT CHARACTERS!!)

No need to shout :-)

Reading the full thread on the bug tracker, I think that when Marcel 
(mguinhos) refers to "characters", he probably is thinking of "code 
points" (not code units, as you put it).

Digression into the confusing Unicode terminology, for the benefit of 
those who are confused... (which also includes me... I'm writing this 
out so I can get it clear in my own mind).

A *code point* is an integer between 0 and 0x10FFFF inclusive, each of 
which represents a Unicode entity.

In common language, we call those entities "characters", although they 
don't perfectly map to characters in natural language. Most code points 
are as yet unused, most of the rest represent natural language 
characters, some represent fragments of characters, and some are 
explicitly designated "non-characters".

(Even the Unicode consortium occasionally calls these abstract entities 
characters, so let's not get too uptight about mislabelling them.)

Abstract code points 0...0x10FFF are all very well and good, but they 
have to be stored in memory somehow, and that's where *code units* come 
into it: a *code unit* is a chunk of memory, usually 8 bits, 16 bits, or 
32 bits.

https://unicode.org/glossary/#code_unit

The number of code units used to represent each code point depends on 
the encoding used:

* UCS-2 is a fixed size encoding, where 1 x 16-bit code unit represents 
  a code point between 0 and 0xFFFF.

* UTF-16 is a variable size encoding, where 1 or 2 x 16-bit code units 
  represents a code point between 0 and 0x10FFFF.

* UCS-4 and UTF-32 are (identical) fixed size encodings, where 1 x 
  32-bit code unit represents each code point.

* UTF-8 is a variable size encoding, where 1, 2, 3 or 4 x 8-bit code 
  units represent each code point.

* UTF-7 is a variable size encoding which uses 1-8 7-bit code units. 
  Let's not talk about that one.

That's Unicode. But TextIOBase doesn't just support Unicode, it also 
supports legacy encodings which don't define code points or code units. 

Nevertheless we can abuse the terminology and pretend that they do, e.g. 
most such legacy encodings use a fixed 1 x 8-bit code unit (a byte) to 
represent a code point (a character). Some are variable size, e.g. 
SHIFT-JIS. So with this mild abuse of terminology, we can pretend that 
all(?) those old legacy encodings are "Unicode".

TL;DR:

Every character, or non-character, or bit of a character, which for the 
sake of brevity I will just call "character", is represented by an 
abstract numeric value between 0 and 0x10FFFF (the code point), which in 
turn is implemented by a chunk of memory between 1 and N bytes in size, 
for some value of N that depends on the encoding.


> One thing you don't seem to understand: Python does *not* know about
> characters natively.  str is an array of *code units*.

Code points, not units.

Except that even the Unicode Consortium sometimes calls them 
"characters" in plain English. E.g. the code point U+0041 which has 
numeric value 0x41 or 65 in decimal represents the character "A".

(Other code points do not represent natural language characters, but if 
ASCII can call control characters like NULL and BEL "characters", we can 
do the same for code points like U+FDD0, official Unicode terminology be 
damned.)


> This is much
> better than the pre-PEP-393 situation (where the unicode type was
> UTF-16, nowadays except for PEP 383 non-decodable bytes there are no
> surrogates to worry about), 

Narrow builds were UCS-2; wide builds were UTC-32.

The situation was complicated in that your terminal was probably UTF-16, 
and so a surrogate pair that Python saw as two code points may have been 
displayed by the terminal as a single character.


> but Python doesn't care if you use NFD,

The *normalisation forms* NFD etc operate at the level of code points, 
not encodings.

I believe you may be trying to distinguish between what Unicode calls 
"graphemes", which is very nearly the same as natural language 
characters (plus control characters, noncharacters, etc), versus plain 
old code points.

For example, the grapheme (natural character) ΓΌ may be normalised as the 
single code point

    U+00FC LATIN SMALL LETTER U WITH DIAERESIS
 
or as a sequence of code points:

    U+0075 LATIN SMALL LETTER U
    U+0308 COMBINING DIAERESIS

I believe that dealing with graphemes is a red-herring, and that is not 
what Marcel has in mind.


-- 
Steve
(the other one)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7IHWFC7JF5W2NGIISUQSBAW6KAQ4ZEKD/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to