On Tue, Jul 17, 2018 at 4:15 AM, Ian Kelly <ian.g.ke...@gmail.com> wrote: > On Mon, Jul 16, 2018 at 12:02 PM Terry Reedy <tjre...@udel.edu> wrote: >> >> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote: >> >> > if your new system used Python3's UTF-32 strings as a foundation, >> >> Since 3.3, Python's strings are not (always) UFT-32 strings. Nor are >> they always UCS-2 (or partly UTF-16) strings. Nor are the always >> Latin-1 or Ascii strings. Python's Flexible String Representation uses >> the narrowest possible internal code for any particular string. This is >> all transparent to the user except for memory size. >> >> In 3.2 and before, Python's Unicode strings were either wide (UFT-32) or >> narrow (UCS-2 + surrogates or UFT-16 minus full compliance). The >> difference was sometimes not transparent, and code that worked on one >> build could fail on the other. Since 3.3, string code should work the >> same on any machines running the same Python version. >> >> > UTF-32, after all, is a variable-width encoding. >> >> Nope. It a fixed-width (32 bits, 4 bytes) encoding. > > Although it only really uses 21 (actually, more like 20.087) of those > bits. Given that and the similar naming, it's easy to see how people > sometimes confuse its structure with UTF-8.
Yes, but that's on par with ASCII text putting seven bits' worth of information into an eight-bit byte. UTF-32 still assigns four bytes per codepoint, even though you could represent any Unicode character with just 21 bits (or, as you say, a smidgen over twenty bits). (Nobody's yet proposed a UTF-24, to my knowledge, even though it would technically work. I suspect that either UTF-32 or UTF-8 would be superior in any situation where UTF-24 might have been used.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list