Dag Sverre Seljebotn, 06.09.2010 20:30:
> Stefan Behnel wrote:
>> Robert Bradshaw, 06.09.2010 19:01:
>>
>>> On Mon, Sep 6, 2010 at 9:36 AM, Dag Sverre Seljebotn
>>>
>>>> I don't understand this suggestion. What happens in each of these cases,
>>>> for different settings of "from __future__ import unicode_literals"?
>>>>
>>>> cdef char* x1 = 'abc\u0001'
>>>>
>>
>> As I said in my other mail, I don't think anyone would use the above in
>> real code. The alternative below is just too obvious and simple.
>>
>>
>>
>>>> cdef char* x2 = 'abc\x01'
>>>>
>>> from __future__ import unicode_literals (or -3)
>>>
>>> len(x1) == 4
>>> len(x2) == 4
>>>
>>> Otherwise
>>>
>>> len(x1) == 9
>>> len(x2) == 4
>>>
>>
>> Hmm, now *that* looks unexpected to me. The way I see it, a C string is the
>> C equivalent of a Python byte string and should always and predictably
>> behave like a Python byte string, regardless of the way Python object
>> literals are handled.
>>
> While the "cdef char*" case isn't that horrible,
>
> f('abc\x01')
>
> is. Imagine throwing in a type in the signature of f and then get
> different data in.
This case is unambiguous. But the following would change.
# using default source code encoding UTF-8
cdef char* cstring = 'abcüöä'
charfunc('abcüöä')
pyfunc('abcüöä')
Here, 'cstring' is assigned a 9 byte long C string which is also passed
into charfunc(). When unicode_literals are enabled, pyfunc() would receive
u'abcüöä', otherwise otherwise it would receive the same 9 bytes long byte
string.
# encoding: ISO-8859-1
cdef char* cstring = 'abcüöä'
charfunc('abcüöä')
pyfunc('abcüöä')
assigns a 6 byte long C string, same for the charfunc() call. With
unicode_literals, pyfunc() would receive u'abcüöä', otherwise, it would
receive a 6 byte long byte string b'abcüöä'.
With the ASCII-only proposal, both examples above would raise an error for
the C string usage and behave as described for the Python strings.
The same string as an escaped literal:
cdef char* cstring = 'abc\xfc\xf6\xe4'
cfunc('abc\xfc\xf6\xe4')
pyfunc('abc\xfc\xf6\xe4')
would assign/pass a 6 byte string, whereas it would be equally disallowed
with the ASCII-only proposal. The Python case would pass a 6 character
unicode or 6 bytes byte string, depending on unicode_literals.
My point is that I don't see a reason for a compiler error. I find the
above behaviour predictable and reasonable.
> I really, really don't like having the value of a literal depend on type
> of the variable it gets assigned to (I know, I know about ints and so
> on, but let's try to keep the number of instances down).
>
> My vote is for identifying a set of completely safe strings (no \x or
> \u, ASCII-only) that is the same regardless of any setting, and allow
> that. Anything else, demand a b'' prefix to assign to a char*. Putting
> in a b'' isn't THAT hard.
Well, then why not keep it the way it was before and *always* require a 'b'
prefix in front of char* literals when unicode_literals is enabled? After
all, it's an explicit option, so users who want to enable it can be
required to adapt their code accordingly.
Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev