Stefan Behnel, 04.09.2010 17:29:
> Carl Witty, 25.08.2010 22:21:
>> On Wed, Aug 25, 2010 at 12:15 PM, Stefan Behnel wrote:
>>> Lisandro Dalcin, 25.08.2010 20:28:
>>>> When trying to cythonize my code using the -3 flag, I got many errors
>>>> like the one below:
>>>>
>>>> Error converting Pyrex file to C:
>>>> ------------------------------------------------------------
>>>> ...
>>>>        if not (<int>PetscInitializeCalled): return
>>>>        if (<int>PetscFinalizeCalled): return
>>>>        # deinstall custom error handler
>>>>        ierr = PetscPopErrorHandlerPython()
>>>>        if ierr != 0:
>>>>            fprintf(stderr, "PetscPopErrorHandler() failed "
>>>>                           ^
>>>> ------------------------------------------------------------
>>>>
>>>> /u/dalcinl/Devel/petsc4py-dev/src/PETSc/PETSc.pyx:307:24: Unicode
>>>> literals do not support coercion to C types other than Py_UNICODE.
>>>
>>> Right, the parser reads the literal as unicode string here before type
>>> analysis figures out that it's really meant to be a bytes literal.
>>>
>>> This will be hard to change as recovering the original bytes literal is
>>> impossible once it's converted to a unicode string (remember that you can
>>> use arbitrary character escape sequences in the literal). So I'm leaning
>>> towards keeping this as an error. After all, Unicode string literals is one
>>> of the things that a user explicitly requests with the -3 switch.
>>
>> How about allowing it for ASCII literals and leaving it an error if
>> there are any codepoints in the literal outside the 0-127 range?
>
> It's not so unlikely that you find C (data) strings that contain (escaped)
> non-ASCII characters. Those strings would need a 'b' prefix then. So you'd
> end up with some C strings that work without prefix and others for which
> you need a 'b', even if both clearly occur in a C char* context.
>
> The problem is, unprefixed string literals found in source code compiled by
> Cython are equally likely to be meant as unicode strings, byte strings, C
> strings or pymorphic strings these days. There isn't one obvious "do what I
> mean" way. Remember that Lisandro brought this up because Cython reported
> an *error* when compiling the code. I find that a lot better than silently
> accepting something that may not have been meant that way.
>
> One thing we could do, however, is to parse all (unprefixed?) strings as
> both unicode strings *and* byte strings. That would induce a (minor) bit of
> overhead in the parser (both in terms of memory and speed), but it would
> allow us to recover the original byte sequence of a Unicode string during
> type analysis if we find that we need to coerce it to a byte string.

http://trac.cython.org/cython_trac/ticket/575
http://hg.cython.org/cython-devel/rev/a0f2c20789e3


> In case we need to, we could then even write both types of byte sequences
> into the string constant table in the C file, so that we can recover the
> exact byte sequence and the correct Unicode character sequence depending on
> the CPython runtime.

Still open. The only obvious use case for this is when using unicode 
escapes in 'str' literals, e.g. "abc\u0987". Here, the correct way to read 
the literal as a byte string is as a 9 character string that reproduces the 
escape sequence, whereas the correct unicode string would be a 4 character 
literal that has the escape sequence resolved. The only way to do this is 
by spelling out both literals in the C code.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to