On Fri, Jul 21, 2017 at 1:20 PM, Steve D'Aprano <steve+pyt...@pearwood.info> wrote: > On Fri, 21 Jul 2017 04:05 am, Marko Rauhamaa wrote: > >> If any string code point is 1114112 or greater > > By definition, no Unicode code point can ever have an ordinal value greater > than > 0x10FFFF = 1114111. > > So I don't know what you're talking about, but it isn't Unicode. If you want > to > invent your own text standard to compete with Unicode, well, good luck getting > people behind it. Especially since it seems to offer nothing but a vast > increase in complexity and memory usage for no apparent benefit that I can > see.
He's talking specifically about an in-memory representation. The term "code point" is inaccurate; I think "code unit" is more accurate, but normally a code unit represents *at most* one code point, and potentially less (UTF-16 and astral chars) - this is using it to represent *more* than one code point. So maybe it needs a different word. Nonetheless, this is a reasonably internally-consistent way to represent textual data. It's a three-tier system: * The string is stored as a series of 32-bit integers, where each integer is either a UTF-32 code unit, or 1114112+n for some value n. * A secondary array of 64-bit integers stores either odd integers consisting of three 21-bit numbers packed together and then the low bit set as a flag, or even integers representing pointers. * The heap is tertiary memory for those combined characters consisting of more than three code points (base character plus more than two combining characters). And while this is extremely complicated, it does at least push most of the complexity to the unusual cases. A pure ASCII string is still able to be represented exactly the same way Python 3.3+ does; all it needs is a spare patch of memory representing an empty array of integers. (Smarter people than I may be able to store this array in zero bytes without getting things confused. I'm not sure.) Strings with all code points on the BMP and no combining characters are still able to be represented as they are today, again with the empty secondary array. The presence of a single combining character in the string does force it to be stored 32 bits per character, so there can be a price to pay. Similarly, the secondary array will only VERY rarely need to contain any pointers; most combined characters consist of a base and one combining, or a set of three characters at most. There'll be dramatic performance costs for strings where piles of combining characters get loaded on top of a single base, but at least they can be accurately represented. However, there's still one major MAJOR problem. The semantics of string handling depend on having a proper table of Unicode character types (or at least a listing of every combining character). As the Unicode standard is augmented, the number of combining characters can increase, and I don't think the Consortium has pre-allocated space saying "this is where combining characters will be placed". So what should be done if there's an unknown code point in a string? Should an exception be raised? Should it assume that it's a base character? Either way, you have versioning problems, and big ones. As such, this could be extremely useful as a tool for textual analysis, but it should NOT, in my opinion, be the internal representation of a string. ChrisA -- https://mail.python.org/mailman/listinfo/python-list