Robert Bradshaw wrote: > On Sat, Sep 4, 2010 at 10:59 PM, Stefan Behnel <[email protected]> wrote: > >> Robert Bradshaw, 05.09.2010 07:06: >> >>> On Sat, Sep 4, 2010 at 9:24 PM, Stefan Behnel wrote: >>> >>>> Robert Bradshaw, 04.09.2010 22:04: >>>> >>>>> How about we parse the literals as unicode strings, and if used in a >>>>> bytes context we raise a compile time error if any characters are >>>>> larger than a char? >>>>> >>>> Can't work because you cannot recover the original byte sequence from a >>>> decoded Unicode string. It may have used escapes or not, and it may or may >>>> not be encodable using the source code encoding. >>>> >>> I'm saying we shouldn't care about using escapes, and should raise a >>> compile time error if it's not encodable using the source encoding. >>> >> In that case, you'd break most code that actually uses escapes. If the byte >> values were correctly representable using the source encoding the escapes >> wouldn't be necessary in the first place. >> > > The most common escape is probably \n, followed by \0, \r, \t... As > for \uXXXX, that is just a superset of \xXX that only works for > unicode literals. > > >>> In other words, I'm not a fan of >>> >>> foo("abc \u0001") >>> >>> behaving (in my opinion) very differently depending on whether foo >>> takes a char* or object argument. >>> >> It's Python compatible, though: >> > > No, it's not. Python doesn't have the concept of "used in a C context." > > >> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) >> [GCC 4.4.3] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> 'abc \u0001' >> 'abc \\u0001' >> >>> len('abc \u0001') >> 10 >> >>> u'abc \u0001' >> u'abc \x01' >> >>> len(u'abc \u0001') >> 5 >> >> Same for Python 3 with the 'b' prefix on the byte string examples. >> > > When I see b"abc \u0001" or u"abc \u0001" I know exactly what it > means. When I see "abc \u0001" I have to know whether unicode literals > are enabled to know what it means, but now you've changed it so that's > not enough anymore--I have to determine whether it's being used in a > char* or object context, which I think is something we want to > minimize. > > I'm with Lisandro and Carl WItty--how about just letting the parser > parse them as unicode literals and then only accepting conversion back > to char* for plain ASCII rather than introducing more complicated > logic and semantics? > I don't understand this suggestion. What happens in each of these cases, for different settings of "from __future__ import unicode_literals"?
cdef char* x1 = 'abc\u0001' cdef char* x2 = 'abc\x01' Dag Sverre _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
