Re: [Cython] C string literals

Stefan Behnel Mon, 06 Sep 2010 11:56:46 -0700

Dag Sverre Seljebotn, 06.09.2010 20:30:
> Stefan Behnel wrote:
>> Robert Bradshaw, 06.09.2010 19:01:
>>
>>> On Mon, Sep 6, 2010 at 9:36 AM, Dag Sverre Seljebotn
>>>
>>>> I don't understand this suggestion. What happens in each of these cases,
>>>> for different settings of "from __future__ import unicode_literals"?
>>>>
>>>> cdef char* x1 = 'abc\u0001'
>>>>
>>
>> As I said in my other mail, I don't think anyone would use the above in
>> real code. The alternative below is just too obvious and simple.
>>
>>
>>
>>>> cdef char* x2 = 'abc\x01'
>>>>
>>> from __future__ import unicode_literals (or -3)
>>>
>>>       len(x1) == 4
>>>       len(x2) == 4
>>>
>>> Otherwise
>>>
>>>       len(x1) == 9
>>>       len(x2) == 4
>>>
>>
>> Hmm, now *that* looks unexpected to me. The way I see it, a C string is the
>> C equivalent of a Python byte string and should always and predictably
>> behave like a Python byte string, regardless of the way Python object
>> literals are handled.
>>
> While the "cdef char*" case isn't that horrible,
>
> f('abc\x01')
>
> is. Imagine throwing in a type in the signature of f and then get
> different data in.


This case is unambiguous. But the following would change.

     # using default source code encoding UTF-8

     cdef char* cstring = 'abcüöä'

     charfunc('abcüöä')

     pyfunc('abcüöä')

Here, 'cstring' is assigned a 9 byte long C string which is also passed 
into charfunc(). When unicode_literals are enabled, pyfunc() would receive 
u'abcüöä', otherwise otherwise it would receive the same 9 bytes long byte 
string.

     # encoding: ISO-8859-1

     cdef char* cstring = 'abcüöä'

     charfunc('abcüöä')

     pyfunc('abcüöä')

assigns a 6 byte long C string, same for the charfunc() call. With 
unicode_literals, pyfunc() would receive u'abcüöä', otherwise, it would 
receive a 6 byte long byte string b'abcüöä'.

With the ASCII-only proposal, both examples above would raise an error for 
the C string usage and behave as described for the Python strings.


The same string as an escaped literal:

     cdef char* cstring = 'abc\xfc\xf6\xe4'

     cfunc('abc\xfc\xf6\xe4')

     pyfunc('abc\xfc\xf6\xe4')

would assign/pass a 6 byte string, whereas it would be equally disallowed 
with the ASCII-only proposal. The Python case would pass a 6 character 
unicode or 6 bytes byte string, depending on unicode_literals.

My point is that I don't see a reason for a compiler error. I find the 
above behaviour predictable and reasonable.


> I really, really don't like having the value of a literal depend on type
> of the variable it gets assigned to (I know, I know about ints and so
> on, but let's try to keep the number of instances down).
>
> My vote is for identifying a set of completely safe strings (no \x or
> \u, ASCII-only) that is the same regardless of any setting, and allow
> that. Anything else, demand a b'' prefix to assign to a char*. Putting
> in a b'' isn't THAT hard.

Well, then why not keep it the way it was before and *always* require a 'b' 
prefix in front of char* literals when unicode_literals is enabled? After 
all, it's an explicit option, so users who want to enable it can be 
required to adapt their code accordingly.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] C string literals

Reply via email to