Please bear with me for a few paragraphs ;-) One aspect of str-type strings is the efficiency afforded when all the encoding really is ascii. If the internal encoding were e.g. fixed utf-16le for strings, maybe with today's computers it would still be efficient enough for most actual string purposes (excluding the current use of str-strings as byte sequences).
I.e., you'd still have to identify what was "strings" (of characters) and what was really byte sequences with no implied or explicit encoding or character semantics. Ok, let's make that distinction explicit: Call one kind of string a byte sequence and the other a character sequence (representation being a separate issue). A unicode object is of course the prime _general_ representation of a character sequence in Python, but all the names in python source code (that become NAME tokens) are UIAM also character sequences, and representable by a byte sequence interpreted according to ascii encoding. For the sake of discussion, suppose we had another _character_ sequence type that was the moral equivalent of unicode except for internal representation, namely a str subclass with an encoding attribute specifying the encoding that you _could_ use to decode the str bytes part to get unicode (which you wouldn't do except when necessary). We could call it class charstr(str): ... and have chrstr().bytes be the str part and chrstr().encoding specify the encoding part. In all the contexts where we have obvious encoding information, we can then generate a charstr instead of a str. E.g., if the source of module_a has # -*- coding: latin1 -*- cs = 'über-cool' then type(cs) # => <type 'charstr'> cs.bytes # => '\xfcber-cool' cs.encoding # => 'latin-1' and print cs would act like print cs.bytes.decode(cs.encoding) -- or I guess sys.stdout.write(cs.bytes.decode(cs.encoding).encode(sys.stdout.encoding) followed by sys.stdout.write('\n'.decode('ascii').encode(sys.stdout.encoding) for the newline of the print. Now if module_b has # -*- coding: utf8 -*- cs = 'über-cool' and we interactively import module_a, module_b and then print module_a.cs + ' =?= ' + module_b.cs what could happen ideally vs. what we have currently? UIAM, currently we would just get the concatenation of the three str byte sequences concatenated to make '\xfcber-cool =?= \xc3\xbcber-cool' and that would be printed as whatever that comes out as without conversion when seen by the output according to sys.stdout.encoding. But if those cs instances had been charstr instances, the coding cookie encoding information would have been preserved, and the interactive print could have evaluated the string expression -- given cs.decode() as sugar for (cs.bytes.decode(cs.encoding or globals().get('__encoding__') or __import__('sys').getdefaultencoding())) -- as module_a.cs.decode() + ' =?= '.decode() + module_b.cs.decode() if pairwise terms differ in encoding as they might all here. If the interactive session source were e.g. latin-1, like module_a, then module_a.cs + ' =?= ' would not require an encoding change, because the ' =?= ' would be a charstr instance with encoding == 'latin-1', and so the result would still be latin-1 that far. But with module_b.cs being utf8, the next addition would cause the .decode() promotions to unicode. In a console window, the ' =?= '.encoding might be 'cp437' or such, and the first addition would then cause promotion (since module_a.cs.encoding != 'cp437'). I have sneaked in run-time access to individual modules' encodings by assuming that the encoding cookie could be compiled in as an explicit global __encoding__ variable for any given module (what to have as __encoding__ for built-in modules could vary for various purposes). ISTM this could have use in situations where an encoding assumption is necessary and currently 'ascii' is not as good a guess as one could make, though I suspect if string literals became charstr strings instead of str strings, many if not most of those situations would disappear (I'm saying this because ATM I can't think of an 'ascii'-guess situation that wouldn't go away ;-) If there were a charchr() version of chr() that would result in a charstr instead of a str, IWT one would want an easy-sugar default encoding assumption, probably based on the same as one would assume for '%c' % num in a given module source -- which presumably would be '%c'.encoding, where '%c' assumes the encoding of the module source, normally recorded in __encoding__. So charchr(n) would act like chr(n).decode().encode(''.encoding) -- or more reasonably charstr(chr(n)), which would be short for charstr(chr(n), globals().get('__encoding__') or __import__('sys').getdefaultencoding()) Or some efficient equivalent ;-) Using strings in dicts requires hashing to find key comparison candidates and comparison to check for key equivalence. This would seem to point to some kind of normalized hashing, but not necessarily normalized key representation. Some is apparently happening, since >>> hash('a') == hash(unicode('a')) True I don't know what would be worth the trouble to optimize string key usage where strings are really all of one encoding vs totally general use vs a heavily biased mix. Or even if it could be done without unreasonable complexity. Maybe a dict could be given an option to hash all its keys as unicode vs whatever it does now. But having a charstr subtype of str would improve the "implicit" conversions to unicode IMO. Anyway, I wanted to throw in my .02USD re the implicit conversions, taking the view that much of the implicitness could be based on reliable inferences from source encodings of string literals or from their effects as format strings. Regards, Bengt Richter [not a normal subscriber to python-dev, so I'll have to google for any responses] _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com