On Sat, Sep 4, 2010 at 10:59 PM, Stefan Behnel <[email protected]> wrote:
> Robert Bradshaw, 05.09.2010 07:06:
>> On Sat, Sep 4, 2010 at 9:24 PM, Stefan Behnel wrote:
>>> Robert Bradshaw, 04.09.2010 22:04:
>>>> How about we parse the literals as unicode strings, and if used in a
>>>> bytes context we raise a compile time error if any characters are
>>>> larger than a char?
>>>
>>> Can't work because you cannot recover the original byte sequence from a
>>> decoded Unicode string. It may have used escapes or not, and it may or may
>>> not be encodable using the source code encoding.
>>
>> I'm saying we shouldn't care about using escapes, and should raise a
>> compile time error if it's not encodable using the source encoding.
>
> In that case, you'd break most code that actually uses escapes. If the byte
> values were correctly representable using the source encoding the escapes
> wouldn't be necessary in the first place.
The most common escape is probably \n, followed by \0, \r, \t... As
for \uXXXX, that is just a superset of \xXX that only works for
unicode literals.
>> In other words, I'm not a fan of
>>
>> foo("abc \u0001")
>>
>> behaving (in my opinion) very differently depending on whether foo
>> takes a char* or object argument.
>
> It's Python compatible, though:
No, it's not. Python doesn't have the concept of "used in a C context."
> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
> [GCC 4.4.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> 'abc \u0001'
> 'abc \\u0001'
> >>> len('abc \u0001')
> 10
> >>> u'abc \u0001'
> u'abc \x01'
> >>> len(u'abc \u0001')
> 5
>
> Same for Python 3 with the 'b' prefix on the byte string examples.
When I see b"abc \u0001" or u"abc \u0001" I know exactly what it
means. When I see "abc \u0001" I have to know whether unicode literals
are enabled to know what it means, but now you've changed it so that's
not enough anymore--I have to determine whether it's being used in a
char* or object context, which I think is something we want to
minimize.
I'm with Lisandro and Carl WItty--how about just letting the parser
parse them as unicode literals and then only accepting conversion back
to char* for plain ASCII rather than introducing more complicated
logic and semantics?
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev