Vitja Makarov, 15.01.2011 19:29: > 2011/1/15 Stefan Behnel: >> Stefan Behnel, 15.01.2011 19:13: >>> Vitja Makarov, 15.01.2011 18:46: >>>> What will it print for '\u1234'? >>>> >>>> Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) >>>> [GCC 4.4.5] on linux2 >>>> Type "help", "copyright", "credits" or "license" for more information. >>>>>>> '\u' >>>> '\\u' >>>>>>> '\u1234' >>>> '\\u1234' >>>>>>> >>>> >>>> I think that '\u' should be translated into '\\u' for python2 >>> >>> That's what it does, yes. This works because we actually parse unprefixed >>> strings in parallel as byte strings and unicode strings. >>> >>> However, now that I tried it, I actually get the same result in Py3, >>> although it should have parsed the string correctly. Not sure if we >>> discussed this problem before, but it looks like a bug to me. >> >> Thinking about this some more, it's inconsistent either way. >> >> 1) If the literal string semantics should be fixed at compile time, you >> shouldn't get a unicode string in Python 3 in the first place. >> >> 2) If the literal should become a byte string in Py2 and a unicode string >> in Py3, then the unicode string should be what you you'd get if you ran >> your code in Py3, i.e. the unescaped unicode literal. >> >> Given that 1) is out of discussion, 2) should be fixed, IMHO. > > Can't we rely on -[23] cython switch? > > In -2 mode strings are always byte string and -3 always unicode?
With -3, unprefixed strings *are* unicode strings. As for -2, there isn't currently any change in behaviour when you use that switch, and I feel reluctant to change that. For one, I doubt that anyone would seriously use it. The problem that unicode escapes in unprefixed strings behave differently in Python 2 and Python 3 is unlikely to create problems in real world code, i.e. outside of CPython's regression test suite. > self.assertEqual(audioop.lin2alaw(data[0], 1), '\xd5\xc5\xf5') That's a different problem. You will notice that this code has been fixed to use the 'b' prefix in the Py3 test suite. This is a problem that cannot be solved automatically. For Python 2 code, the compiler cannot know if the user intended an unprefixed literal to be a (binary) byte string or a unicode (text) string. Only a human brain can disambiguate the code here. Remember that Python 2 will also try to decode the above binary bytes literal if it happens to be concatenated with a unicode string for some reason. String handling is structurally hard to get right in Python 2, we have to live with that (and hope that Py2 will die out soon). I think it's a great feature of Cython that it fails fast and thus tells you that your code is ambiguous and requires changes to work in Python 3. It perfectly found the problems in the above code, for one. Stefan _______________________________________________ Cython-dev mailing list Cython-dev@codespeak.net http://codespeak.net/mailman/listinfo/cython-dev