Re: [Cython] string literal parsing problem

Robert Bradshaw Sat, 15 Jan 2011 12:35:31 -0800

On Sat, Jan 15, 2011 at 10:55 AM, Stefan Behnel <stefan...@behnel.de> wrote:
> Vitja Makarov, 15.01.2011 19:29:
>> 2011/1/15 Stefan Behnel:
>>> Stefan Behnel, 15.01.2011 19:13:
>>>> Vitja Makarov, 15.01.2011 18:46:
>>>>> What will it print for '\u1234'?
>>>>>
>>>>> Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56)
>>>>> [GCC 4.4.5] on linux2
>>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>>>>>> '\u'
>>>>> '\\u'
>>>>>>>> '\u1234'
>>>>> '\\u1234'
>>>>>>>>
>>>>>
>>>>> I think that '\u' should be translated into '\\u' for python2
>>>>
>>>> That's what it does, yes. This works because we actually parse unprefixed
>>>> strings in parallel as byte strings and unicode strings.
>>>>
>>>> However, now that I tried it, I actually get the same result in Py3,
>>>> although it should have parsed the string correctly. Not sure if we
>>>> discussed this problem before, but it looks like a bug to me.
>>>
>>> Thinking about this some more, it's inconsistent either way.
>>>
>>> 1) If the literal string semantics should be fixed at compile time, you
>>> shouldn't get a unicode string in Python 3 in the first place.
>>>
>>> 2) If the literal should become a byte string in Py2 and a unicode string
>>> in Py3, then the unicode string should be what you you'd get if you ran
>>> your code in Py3, i.e. the unescaped unicode literal.
>>>
>>> Given that 1) is out of discussion, 2) should be fixed, IMHO.
>>
>> Can't we rely on -[23] cython switch?
>>
>> In -2 mode strings are always byte string and -3 always unicode?
>
> With -3, unprefixed strings *are* unicode strings.
>
> As for -2, there isn't currently any change in behaviour when you use that
> switch, and I feel reluctant to change that. For one, I doubt that anyone
> would seriously use it. The problem that unicode escapes in unprefixed
> strings behave differently in Python 2 and Python 3 is unlikely to create
> problems in real world code, i.e. outside of CPython's regression test suite.
>
>
>>          self.assertEqual(audioop.lin2alaw(data[0], 1), '\xd5\xc5\xf5')
>
> That's a different problem. You will notice that this code has been fixed
> to use the 'b' prefix in the Py3 test suite.
>
> This is a problem that cannot be solved automatically. For Python 2 code,
> the compiler cannot know if the user intended an unprefixed literal to be a
> (binary) byte string or a unicode (text) string. Only a human brain can
> disambiguate the code here. Remember that Python 2 will also try to decode
> the above binary bytes literal if it happens to be concatenated with a
> unicode string for some reason. String handling is structurally hard to get
> right in Python 2, we have to live with that (and hope that Py2 will die
> out soon).


I wouldn't count on it.

> I think it's a great feature of Cython that it fails fast and thus tells
> you that your code is ambiguous and requires changes to work in Python 3.
> It perfectly found the problems in the above code, for one.

Whether it's the -2 flag, or something else, we should at least have a
mode that handles things exactly as they would be handled in Python 2.
Otherwise people won't be able to just compile their existing code
without worrying about subtle issues like this. Of course, a compile
time error is much more acceptable than different runtime behavior.

- Robert
_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] string literal parsing problem

Reply via email to