>>> Then bytes can be bytes, and unicode can be unicode, and str8 can be >>> encoded strings for interfacing with the outside non-unicode world. Or >>> something like that. <shrug> >> >> Hm... Requiring each str8 instance to have an encoding might be a >> problem -- it means you can't just create one from a bytes object. >> What would be the use of this information? What would happen on >> concatenation? On slicing? (Slicing can break the encoding!) > > Round trips to and from bytes should work just fine. Why would that be > a problem?
I'm strongly opposed to adding encoding information to str8 objects. I think they will eventually go away, anyway; adding that kind of overhead now is both a waste of developer's time and of memory resources; plus it has all the semantic issues that Guido points out. As for creating str8 objects from bytes objects: If you want the str8 object to carry an encoding, you would have to *specify* the encoding when creating the str8 object, since the bytes object does not have that information. This is *very* hard, as you may not know what the encoding is when you need to create the str8 object. > There really is no safety in concatenation and slicing of encoded 8bit > strings now. If by accident two strings of different encodings are > combined, then all bets are off. And since there is no way to ask a > string what it's current encoding is, it becomes an easy to make and > hard to find silent error. So we have to be very careful not to mix > encoded strings with different encodings. Please answer the question: what would happen on concatenation? In particular, what is the value of the encoding for the result of the concatenated string if one input is "latin-1", and the other one is "utf-8"? It's easy to tell what happens now: the bytes of those input strings are just appended; the result string does not follow a consistent character encoding anymore. This answer does not apply to your proposed modification, as it does not answer what the value of the .encoding attribute of the str8 would be after concatenation (likewise for slicing). > It's not too different from trying to find the current unicode and str8 > issues in the py3k-struni branch. This sentence I do not understand. What is not too different from trying to find issues? > Concatenating str8 and str types is a bit safer, as long as the str8 is > in in "the" default encoding, but it may still be an unintended implicit > conversion. And if it's not in the default encoding, then all bets are > off again. Sure. However, the str8 type will go away, and along with it all these issues. > The use would be in ensuring the integrity of encoded strings. > Concatenating strings with different encodings could then produce > errors. Ok. What about slicing? Regards, Martin _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
