Re: [Cython] C string literals

Dag Sverre Seljebotn Mon, 06 Sep 2010 09:37:02 -0700

Robert Bradshaw wrote:
> On Sat, Sep 4, 2010 at 10:59 PM, Stefan Behnel <[email protected]> wrote:
>   
>> Robert Bradshaw, 05.09.2010 07:06:
>>     
>>> On Sat, Sep 4, 2010 at 9:24 PM, Stefan Behnel wrote:
>>>       
>>>> Robert Bradshaw, 04.09.2010 22:04:
>>>>         
>>>>> How about we parse the literals as unicode strings, and if used in a
>>>>> bytes context we raise a compile time error if any characters are
>>>>> larger than a char?
>>>>>           
>>>> Can't work because you cannot recover the original byte sequence from a
>>>> decoded Unicode string. It may have used escapes or not, and it may or may
>>>> not be encodable using the source code encoding.
>>>>         
>>> I'm saying we shouldn't care about using escapes, and should raise a
>>> compile time error if it's not encodable using the source encoding.
>>>       
>> In that case, you'd break most code that actually uses escapes. If the byte
>> values were correctly representable using the source encoding the escapes
>> wouldn't be necessary in the first place.
>>     
>
> The most common escape is probably \n, followed by \0, \r, \t... As
> for \uXXXX, that is just a superset of \xXX that only works for
> unicode literals.
>
>   
>>> In other words, I'm not a fan of
>>>
>>>      foo("abc \u0001")
>>>
>>> behaving (in my opinion) very differently depending on whether foo
>>> takes a char* or object argument.
>>>       
>> It's Python compatible, though:
>>     
>
> No, it's not. Python doesn't have the concept of "used in a C context."
>
>   
>>     Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
>>     [GCC 4.4.3] on linux2
>>     Type "help", "copyright", "credits" or "license" for more information.
>>     >>> 'abc \u0001'
>>     'abc \\u0001'
>>     >>> len('abc \u0001')
>>     10
>>     >>> u'abc \u0001'
>>     u'abc \x01'
>>     >>> len(u'abc \u0001')
>>     5
>>
>> Same for Python 3 with the 'b' prefix on the byte string examples.
>>     
>
> When I see b"abc \u0001" or u"abc \u0001" I know exactly what it
> means. When I see "abc \u0001" I have to know whether unicode literals
> are enabled to know what it means, but now you've changed it so that's
> not enough anymore--I have to determine whether it's being used in a
> char* or object context, which I think is something we want to
> minimize.
>
> I'm with Lisandro and Carl WItty--how about just letting the parser
> parse them as unicode literals and then only accepting conversion back
> to char* for plain ASCII rather than introducing more complicated
> logic and semantics?
>   
I don't understand this suggestion. What happens in each of these cases, 
for different settings of "from __future__ import unicode_literals"?


cdef char* x1 = 'abc\u0001'
cdef char* x2 = 'abc\x01'


Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] C string literals

Reply via email to