Re: [Cython] C string literals

Stefan Behnel Sat, 04 Sep 2010 08:29:38 -0700

Carl Witty, 25.08.2010 22:21:
> On Wed, Aug 25, 2010 at 12:15 PM, Stefan Behnel wrote:
>> Lisandro Dalcin, 25.08.2010 20:28:
>>> When trying to cythonize my code using the -3 flag, I got many errors
>>> like the one below:
>>>
>>> Error converting Pyrex file to C:
>>> ------------------------------------------------------------
>>> ...
>>>       if not (<int>PetscInitializeCalled): return
>>>       if (<int>PetscFinalizeCalled): return
>>>       # deinstall custom error handler
>>>       ierr = PetscPopErrorHandlerPython()
>>>       if ierr != 0:
>>>           fprintf(stderr, "PetscPopErrorHandler() failed "
>>>                          ^
>>> ------------------------------------------------------------
>>>
>>> /u/dalcinl/Devel/petsc4py-dev/src/PETSc/PETSc.pyx:307:24: Unicode
>>> literals do not support coercion to C types other than Py_UNICODE.
>>
>> Right, the parser reads the literal as unicode string here before type
>> analysis figures out that it's really meant to be a bytes literal.
>>
>> This will be hard to change as recovering the original bytes literal is
>> impossible once it's converted to a unicode string (remember that you can
>> use arbitrary character escape sequences in the literal). So I'm leaning
>> towards keeping this as an error. After all, Unicode string literals is one
>> of the things that a user explicitly requests with the -3 switch.
>
> How about allowing it for ASCII literals and leaving it an error if
> there are any codepoints in the literal outside the 0-127 range?


It's not so unlikely that you find C (data) strings that contain (escaped) 
non-ASCII characters. Those strings would need a 'b' prefix then. So you'd 
end up with some C strings that work without prefix and others for which 
you need a 'b', even if both clearly occur in a C char* context.

The problem is, unprefixed string literals found in source code compiled by 
Cython are equally likely to be meant as unicode strings, byte strings, C 
strings or pymorphic strings these days. There isn't one obvious "do what I 
mean" way. Remember that Lisandro brought this up because Cython reported 
an *error* when compiling the code. I find that a lot better than silently 
accepting something that may not have been meant that way.

One thing we could do, however, is to parse all (unprefixed?) strings as 
both unicode strings *and* byte strings. That would induce a (minor) bit of 
overhead in the parser (both in terms of memory and speed), but it would 
allow us to recover the original byte sequence of a Unicode string during 
type analysis if we find that we need to coerce it to a byte string.

In case we need to, we could then even write both types of byte sequences 
into the string constant table in the C file, so that we can recover the 
exact byte sequence and the correct Unicode character sequence depending on 
the CPython runtime.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] C string literals

Reply via email to