[Python-ideas] Re: TextIOBase: Make tell() and seek() pythonic

Steven D'Aprano Thu, 26 May 2022 03:34:28 -0700

On Wed, May 25, 2022 at 06:16:50PM +0900, Stephen J. Turnbull wrote:
> mguin...@gmail.com writes:
> 
>  > There should be a safer abstraction to these two basic functions.
> 
> There is: TextIOBase.read, then treat it as an array of code units
> (NOT CHARACTERS!!)

No need to shout :-)

Reading the full thread on the bug tracker, I think that when Marcel
(mguinhos) refers to "characters", he probably is thinking of "code
points" (not code units, as you put it).

Digression into the confusing Unicode terminology, for the benefit of
those who are confused... (which also includes me... I'm writing this
out so I can get it clear in my own mind).

A *code point* is an integer between 0 and 0x10FFFF inclusive, each of
which represents a Unicode entity.

In common language, we call those entities "characters", although they
don't perfectly map to characters in natural language. Most code points
are as yet unused, most of the rest represent natural language
characters, some represent fragments of characters, and some are
explicitly designated "non-characters".

(Even the Unicode consortium occasionally calls these abstract entities
characters, so let's not get too uptight about mislabelling them.)

Abstract code points 0...0x10FFF are all very well and good, but they
have to be stored in memory somehow, and that's where *code units* come
into it: a *code unit* is a chunk of memory, usually 8 bits, 16 bits, or
32 bits.

https://unicode.org/glossary/#code_unit

The number of code units used to represent each code point depends on
the encoding used:

* UCS-2 is a fixed size encoding, where 1 x 16-bit code unit represents
a code point between 0 and 0xFFFF.

* UTF-16 is a variable size encoding, where 1 or 2 x 16-bit code units
represents a code point between 0 and 0x10FFFF.

* UCS-4 and UTF-32 are (identical) fixed size encodings, where 1 x
32-bit code unit represents each code point.

* UTF-8 is a variable size encoding, where 1, 2, 3 or 4 x 8-bit code
units represent each code point.

* UTF-7 is a variable size encoding which uses 1-8 7-bit code units.
Let's not talk about that one.

That's Unicode. But TextIOBase doesn't just support Unicode, it also
supports legacy encodings which don't define code points or code units.

Nevertheless we can abuse the terminology and pretend that they do, e.g.
most such legacy encodings use a fixed 1 x 8-bit code unit (a byte) to
represent a code point (a character). Some are variable size, e.g.
SHIFT-JIS. So with this mild abuse of terminology, we can pretend that
all(?) those old legacy encodings are "Unicode".

TL;DR:

Every character, or non-character, or bit of a character, which for the
sake of brevity I will just call "character", is represented by an
abstract numeric value between 0 and 0x10FFFF (the code point), which in
turn is implemented by a chunk of memory between 1 and N bytes in size,
for some value of N that depends on the encoding.

> One thing you don't seem to understand: Python does *not* know about
> characters natively. str is an array of *code units*.

Code points, not units.

Except that even the Unicode Consortium sometimes calls them
"characters" in plain English. E.g. the code point U+0041 which has
numeric value 0x41 or 65 in decimal represents the character "A".

(Other code points do not represent natural language characters, but if
ASCII can call control characters like NULL and BEL "characters", we can
do the same for code points like U+FDD0, official Unicode terminology be
damned.)

> This is much
> better than the pre-PEP-393 situation (where the unicode type was
> UTF-16, nowadays except for PEP 383 non-decodable bytes there are no
> surrogates to worry about),

Narrow builds were UCS-2; wide builds were UTC-32.

The situation was complicated in that your terminal was probably UTF-16,
and so a surrogate pair that Python saw as two code points may have been
displayed by the terminal as a single character.

> but Python doesn't care if you use NFD,

The *normalisation forms* NFD etc operate at the level of code points,
not encodings.

I believe you may be trying to distinguish between what Unicode calls
"graphemes", which is very nearly the same as natural language
characters (plus control characters, noncharacters, etc), versus plain
old code points.

For example, the grapheme (natural character) ü may be normalised as the
single code point

U+00FC LATIN SMALL LETTER U WITH DIAERESIS

or as a sequence of code points:

U+0075 LATIN SMALL LETTER U
U+0308 COMBINING DIAERESIS

I believe that dealing with graphemes is a red-herring, and that is not
what Marcel has in mind.

--
Steve
(the other one)
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/7IHWFC7JF5W2NGIISUQSBAW6KAQ4ZEKD/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: TextIOBase: Make tell() and seek() pythonic

Reply via email to