On Mon, Sep 6, 2010 at 9:36 AM, Dag Sverre Seljebotn
<[email protected]> wrote:
> Robert Bradshaw wrote:
>> On Sat, Sep 4, 2010 at 10:59 PM, Stefan Behnel <[email protected]> wrote:
>>
>>> Robert Bradshaw, 05.09.2010 07:06:
>>>
>>>> On Sat, Sep 4, 2010 at 9:24 PM, Stefan Behnel wrote:
>>>>
>>>>> Robert Bradshaw, 04.09.2010 22:04:
>>>>>
>>>>>> How about we parse the literals as unicode strings, and if used in a
>>>>>> bytes context we raise a compile time error if any characters are
>>>>>> larger than a char?
>>>>>>
>>>>> Can't work because you cannot recover the original byte sequence from a
>>>>> decoded Unicode string. It may have used escapes or not, and it may or may
>>>>> not be encodable using the source code encoding.
>>>>>
>>>> I'm saying we shouldn't care about using escapes, and should raise a
>>>> compile time error if it's not encodable using the source encoding.
>>>>
>>> In that case, you'd break most code that actually uses escapes. If the byte
>>> values were correctly representable using the source encoding the escapes
>>> wouldn't be necessary in the first place.
>>>
>>
>> The most common escape is probably \n, followed by \0, \r, \t... As
>> for \uXXXX, that is just a superset of \xXX that only works for
>> unicode literals.
>>
>>
>>>> In other words, I'm not a fan of
>>>>
>>>> foo("abc \u0001")
>>>>
>>>> behaving (in my opinion) very differently depending on whether foo
>>>> takes a char* or object argument.
>>>>
>>> It's Python compatible, though:
>>>
>>
>> No, it's not. Python doesn't have the concept of "used in a C context."
>>
>>
>>> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
>>> [GCC 4.4.3] on linux2
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> 'abc \u0001'
>>> 'abc \\u0001'
>>> >>> len('abc \u0001')
>>> 10
>>> >>> u'abc \u0001'
>>> u'abc \x01'
>>> >>> len(u'abc \u0001')
>>> 5
>>>
>>> Same for Python 3 with the 'b' prefix on the byte string examples.
>>>
>>
>> When I see b"abc \u0001" or u"abc \u0001" I know exactly what it
>> means. When I see "abc \u0001" I have to know whether unicode literals
>> are enabled to know what it means, but now you've changed it so that's
>> not enough anymore--I have to determine whether it's being used in a
>> char* or object context, which I think is something we want to
>> minimize.
>>
>> I'm with Lisandro and Carl WItty--how about just letting the parser
>> parse them as unicode literals and then only accepting conversion back
>> to char* for plain ASCII rather than introducing more complicated
>> logic and semantics?
>>
> I don't understand this suggestion. What happens in each of these cases,
> for different settings of "from __future__ import unicode_literals"?
>
> cdef char* x1 = 'abc\u0001'
> cdef char* x2 = 'abc\x01'
from __future__ import unicode_literals (or -3)
len(x1) == 4
len(x2) == 4
Otherwise
len(x1) == 9
len(x2) == 4
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev