Re: [Cython] string literal parsing problem

Stefan Behnel Sat, 15 Jan 2011 10:55:53 -0800

Vitja Makarov, 15.01.2011 19:29:
> 2011/1/15 Stefan Behnel:
>> Stefan Behnel, 15.01.2011 19:13:
>>> Vitja Makarov, 15.01.2011 18:46:
>>>> What will it print for '\u1234'?
>>>>
>>>> Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
>>>> [GCC 4.4.5] on linux2
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>>> '\u'
>>>> '\\u'
>>>>>>> '\u1234'
>>>> '\\u1234'
>>>>>>>
>>>>
>>>> I think that '\u' should be translated into '\\u' for python2
>>>
>>> That's what it does, yes. This works because we actually parse unprefixed
>>> strings in parallel as byte strings and unicode strings.
>>>
>>> However, now that I tried it, I actually get the same result in Py3,
>>> although it should have parsed the string correctly. Not sure if we
>>> discussed this problem before, but it looks like a bug to me.
>>
>> Thinking about this some more, it's inconsistent either way.
>>
>> 1) If the literal string semantics should be fixed at compile time, you
>> shouldn't get a unicode string in Python 3 in the first place.
>>
>> 2) If the literal should become a byte string in Py2 and a unicode string
>> in Py3, then the unicode string should be what you you'd get if you ran
>> your code in Py3, i.e. the unescaped unicode literal.
>>
>> Given that 1) is out of discussion, 2) should be fixed, IMHO.
>
> Can't we rely on -[23] cython switch?
>
> In -2 mode strings are always byte string and -3 always unicode?


With -3, unprefixed strings *are* unicode strings.

As for -2, there isn't currently any change in behaviour when you use that 
switch, and I feel reluctant to change that. For one, I doubt that anyone 
would seriously use it. The problem that unicode escapes in unprefixed 
strings behave differently in Python 2 and Python 3 is unlikely to create 
problems in real world code, i.e. outside of CPython's regression test suite.


>          self.assertEqual(audioop.lin2alaw(data[0], 1), '\xd5\xc5\xf5')

That's a different problem. You will notice that this code has been fixed 
to use the 'b' prefix in the Py3 test suite.

This is a problem that cannot be solved automatically. For Python 2 code, 
the compiler cannot know if the user intended an unprefixed literal to be a 
(binary) byte string or a unicode (text) string. Only a human brain can 
disambiguate the code here. Remember that Python 2 will also try to decode 
the above binary bytes literal if it happens to be concatenated with a 
unicode string for some reason. String handling is structurally hard to get 
right in Python 2, we have to live with that (and hope that Py2 will die 
out soon).

I think it's a great feature of Cython that it fails fast and thus tells 
you that your code is ambiguous and requires changes to work in Python 3. 
It perfectly found the problems in the above code, for one.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] string literal parsing problem

Reply via email to