Guido van Rossum wrote: > On 6/13/07, Ron Adam <[EMAIL PROTECTED]> wrote: >> Well I can see where a str8() type with an __incoded_with__ attribute >> could >> be useful. It would use a bit more memory, but it won't be the >> default/primary string type anymore so maybe it's ok. >> >> Then bytes can be bytes, and unicode can be unicode, and str8 can be >> encoded strings for interfacing with the outside non-unicode world. Or >> something like that. <shrug> > > Hm... Requiring each str8 instance to have an encoding might be a > problem -- it means you can't just create one from a bytes object. > What would be the use of this information? What would happen on > concatenation? On slicing? (Slicing can break the encoding!)
Round trips to and from bytes should work just fine. Why would that be a problem? There really is no safety in concatenation and slicing of encoded 8bit strings now. If by accident two strings of different encodings are combined, then all bets are off. And since there is no way to ask a string what it's current encoding is, it becomes an easy to make and hard to find silent error. So we have to be very careful not to mix encoded strings with different encodings. It's not too different from trying to find the current unicode and str8 issues in the py3k-struni branch. Concatenating str8 and str types is a bit safer, as long as the str8 is in in "the" default encoding, but it may still be an unintended implicit conversion. And if it's not in the default encoding, then all bets are off again. The use would be in ensuring the integrity of encoded strings. Concatenating strings with different encodings could then produce errors. Explicit casting could automatically decode and encode as needed. Which would eliminate a lot of encode/decode confusion. This morning I was thinking all of this could be done as a module that possibly uses metaclass's or mixins to create encoded string types. Then it wouldn't need an attribute on the instances. Possibly someone has already did something along that lines? But Back to the issues at hand... >> Attached both the str8 repr as s"..." and s'...', and the latest >> no_raw_escape patch which I think is complete now and should apply >> with no >> problems. > > I like the str8 repr patch enough to check it in. > >> I tracked the random fails I am having in test_tokenize.py down to it >> doing >> a round trip on random test_*.py files. If one of those files has a >> problem it causes test_tokanize.py to fail also. So I added a line to >> the >> test to output the file name it does the round trip on so those can be >> fixed as they are found. >> >> Let me know it needs to be adjusted or something doesn't look right. > > Well, I'm still philosophically uneasy with r'\' being a valid string > literal, for various reasons (one being that writing a string parser > becomes harder and harder). Hmmm.. It looks to me the thing that makes it somewhat hard is in determining weather or not its a single-quote, empty-single-quote, or triple-quote string. I made some improvements to that in tokenize.c although it may not be clear from just looking at the unified diff. After that, it was just a matter of checking a !is_raw_str flag before always blindly accepting the following character. Before that it was a matter of doing that, and checking the quote type status, as well which wasn't intuitive since the string parsing loop was entered before the beginning quote type was confirmed. I can remove the raw string flag and flag-check and leave the other changes in or revert the whole file back. Any preference? The later makes it an easy approximate three line change to add r'\' support back in. I'll have to look at tokanize.py again to see what needs to be done there. It uses regular expressions to parse the file. I definitely want r'\u1234' to be a > 6-character string, however. Do you have a patch that does just that? > (We can argue over the rest later in a larger forum.) I can split the patch into two patches. And the second allow escape at end of strings patch can be reviewed later. What about br'\'? Should that be excluded also? Ron _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
