Carl Witty, 25.08.2010 22:21: > On Wed, Aug 25, 2010 at 12:15 PM, Stefan Behnel wrote: >> Lisandro Dalcin, 25.08.2010 20:28: >>> When trying to cythonize my code using the -3 flag, I got many errors >>> like the one below: >>> >>> Error converting Pyrex file to C: >>> ------------------------------------------------------------ >>> ... >>> if not (<int>PetscInitializeCalled): return >>> if (<int>PetscFinalizeCalled): return >>> # deinstall custom error handler >>> ierr = PetscPopErrorHandlerPython() >>> if ierr != 0: >>> fprintf(stderr, "PetscPopErrorHandler() failed " >>> ^ >>> ------------------------------------------------------------ >>> >>> /u/dalcinl/Devel/petsc4py-dev/src/PETSc/PETSc.pyx:307:24: Unicode >>> literals do not support coercion to C types other than Py_UNICODE. >> >> Right, the parser reads the literal as unicode string here before type >> analysis figures out that it's really meant to be a bytes literal. >> >> This will be hard to change as recovering the original bytes literal is >> impossible once it's converted to a unicode string (remember that you can >> use arbitrary character escape sequences in the literal). So I'm leaning >> towards keeping this as an error. After all, Unicode string literals is one >> of the things that a user explicitly requests with the -3 switch. > > How about allowing it for ASCII literals and leaving it an error if > there are any codepoints in the literal outside the 0-127 range?
It's not so unlikely that you find C (data) strings that contain (escaped) non-ASCII characters. Those strings would need a 'b' prefix then. So you'd end up with some C strings that work without prefix and others for which you need a 'b', even if both clearly occur in a C char* context. The problem is, unprefixed string literals found in source code compiled by Cython are equally likely to be meant as unicode strings, byte strings, C strings or pymorphic strings these days. There isn't one obvious "do what I mean" way. Remember that Lisandro brought this up because Cython reported an *error* when compiling the code. I find that a lot better than silently accepting something that may not have been meant that way. One thing we could do, however, is to parse all (unprefixed?) strings as both unicode strings *and* byte strings. That would induce a (minor) bit of overhead in the parser (both in terms of memory and speed), but it would allow us to recover the original byte sequence of a Unicode string during type analysis if we find that we need to coerce it to a byte string. In case we need to, we could then even write both types of byte sequences into the string constant table in the C file, so that we can recover the exact byte sequence and the correct Unicode character sequence depending on the CPython runtime. Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
