On Tue, 03 Jun 2014 20:37:27 -0700, Rustom Mody wrote:
> On Wednesday, June 4, 2014 3:11:12 AM UTC+5:30, Paul Sokolovsky wrote:
>> With that in mind, I, as many others, think that forcing Unicode bloat
>> upon people by default is the most controversial feature of Python3.
>> The reason is that you go very long way dealing with languages of the
>> people of the world by just treating strings as consisting of 8-bit
>> data. I'd say, that's enough for 90% of applications. Unicode is needed
>> only if one needs to deal with multiple languages *at the same time*,
>> which is fairly rare (remaining 10% of apps).
>> And please keep in mind that MicroPython was originally intended (and
>> should be remain scalable down to) an MCU. Unicode needed there is even
>> less, and even less resources to support Unicode just because.
> At some time (when jmf was making more intelligible noises) I had
> suggested that the choice between 1/2/4 byte strings that happens at
> runtime in python3's FSR can be made at python-start time with a
> command-line switch. There are many combinations here; here is one in
> more detail:
> Instead of having one (FSR) string engine, you have (upto) 4
> - a pure 1 byte (ASCII)
There are only 128 ASCII characters, so a pure ASCII implementation
cannot even represent arbitrary bytes.
> - a pure 2 byte (BMP) with decode-failures for out-of-ranges
That's not Unicode. It's a subset of Unicode.
> - a pure 4 byte -- everything UTF-32
For embedded devices, that would be extremely memory hungry. Remember,
every variable, every attribute name, every method and class and function
name is a string. Using at least 56 bytes just to refer to
sys.stdout.write will be painful.
> - FSR dynamic switching at runtime (with massive moping from the world's
Please stop giving JMF's crackpot opinion even the dignity of being
> 2. My casual/cursory reading of the contents of the SMP-planes suggests
> that the stuff there is are things like - egyptian hieroplyphics
> - mahjong characters
> - ancient greek musical symbols
> - alchemical symbols etc etc.
> IOW from pov of a universallly acceptable character set this is mostly
Certainly some of these things are more whimsical than practical, but it
doesn't really matter. Even if you strip out every bit of whimsy from the
Unicode character set, you're still left with needing more than 65536
characters (16 bits). For efficiency you aren't going to use 17 bits, or
18, or 19, so it's actually faster and more efficient to jump right to 32
bits. For technical reasons which I don't fully understand, Unicode only
uses 21 of those 32 bits, giving a total of 1114112 available code
points. Whether you or I personally have need for alchemical symbols,
*some people* do, and supporting their use-case doesn't harm us by one
> And so a pure BMP-supporting implementation may be a reasonable
> compromise. [As long as no surrogate-pairs are there]
At the cost on one extra bit, strings could use UTF-16 internally and
still have correct behaviour. The bit could be a flag recording whether
the string contains any surrogate pairs. If the flag was 0, all string
operations could assume a constant 2-bytes-per-character. If the flag was
1, it could fall back to walking the string checking for surrogate pairs.