On Wed, May 25, 2022 at 06:16:50PM +0900, Stephen J. Turnbull wrote: > mguin...@gmail.com writes: > > > There should be a safer abstraction to these two basic functions. > > There is: TextIOBase.read, then treat it as an array of code units > (NOT CHARACTERS!!)
No need to shout :-) Reading the full thread on the bug tracker, I think that when Marcel (mguinhos) refers to "characters", he probably is thinking of "code points" (not code units, as you put it). Digression into the confusing Unicode terminology, for the benefit of those who are confused... (which also includes me... I'm writing this out so I can get it clear in my own mind). A *code point* is an integer between 0 and 0x10FFFF inclusive, each of which represents a Unicode entity. In common language, we call those entities "characters", although they don't perfectly map to characters in natural language. Most code points are as yet unused, most of the rest represent natural language characters, some represent fragments of characters, and some are explicitly designated "non-characters". (Even the Unicode consortium occasionally calls these abstract entities characters, so let's not get too uptight about mislabelling them.) Abstract code points 0...0x10FFF are all very well and good, but they have to be stored in memory somehow, and that's where *code units* come into it: a *code unit* is a chunk of memory, usually 8 bits, 16 bits, or 32 bits. https://unicode.org/glossary/#code_unit The number of code units used to represent each code point depends on the encoding used: * UCS-2 is a fixed size encoding, where 1 x 16-bit code unit represents a code point between 0 and 0xFFFF. * UTF-16 is a variable size encoding, where 1 or 2 x 16-bit code units represents a code point between 0 and 0x10FFFF. * UCS-4 and UTF-32 are (identical) fixed size encodings, where 1 x 32-bit code unit represents each code point. * UTF-8 is a variable size encoding, where 1, 2, 3 or 4 x 8-bit code units represent each code point. * UTF-7 is a variable size encoding which uses 1-8 7-bit code units. Let's not talk about that one. That's Unicode. But TextIOBase doesn't just support Unicode, it also supports legacy encodings which don't define code points or code units. Nevertheless we can abuse the terminology and pretend that they do, e.g. most such legacy encodings use a fixed 1 x 8-bit code unit (a byte) to represent a code point (a character). Some are variable size, e.g. SHIFT-JIS. So with this mild abuse of terminology, we can pretend that all(?) those old legacy encodings are "Unicode". TL;DR: Every character, or non-character, or bit of a character, which for the sake of brevity I will just call "character", is represented by an abstract numeric value between 0 and 0x10FFFF (the code point), which in turn is implemented by a chunk of memory between 1 and N bytes in size, for some value of N that depends on the encoding. > One thing you don't seem to understand: Python does *not* know about > characters natively. str is an array of *code units*. Code points, not units. Except that even the Unicode Consortium sometimes calls them "characters" in plain English. E.g. the code point U+0041 which has numeric value 0x41 or 65 in decimal represents the character "A". (Other code points do not represent natural language characters, but if ASCII can call control characters like NULL and BEL "characters", we can do the same for code points like U+FDD0, official Unicode terminology be damned.) > This is much > better than the pre-PEP-393 situation (where the unicode type was > UTF-16, nowadays except for PEP 383 non-decodable bytes there are no > surrogates to worry about), Narrow builds were UCS-2; wide builds were UTC-32. The situation was complicated in that your terminal was probably UTF-16, and so a surrogate pair that Python saw as two code points may have been displayed by the terminal as a single character. > but Python doesn't care if you use NFD, The *normalisation forms* NFD etc operate at the level of code points, not encodings. I believe you may be trying to distinguish between what Unicode calls "graphemes", which is very nearly the same as natural language characters (plus control characters, noncharacters, etc), versus plain old code points. For example, the grapheme (natural character) ΓΌ may be normalised as the single code point U+00FC LATIN SMALL LETTER U WITH DIAERESIS or as a sequence of code points: U+0075 LATIN SMALL LETTER U U+0308 COMBINING DIAERESIS I believe that dealing with graphemes is a red-herring, and that is not what Marcel has in mind. -- Steve (the other one) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7IHWFC7JF5W2NGIISUQSBAW6KAQ4ZEKD/ Code of Conduct: http://python.org/psf/codeofconduct/