On Sat, Sep 4, 2010 at 8:29 AM, Stefan Behnel <[email protected]> wrote: > Carl Witty, 25.08.2010 22:21: >> On Wed, Aug 25, 2010 at 12:15 PM, Stefan Behnel wrote: >>> Lisandro Dalcin, 25.08.2010 20:28: >>>> When trying to cythonize my code using the -3 flag, I got many errors >>>> like the one below: >>>> >>>> Error converting Pyrex file to C: >>>> ------------------------------------------------------------ >>>> ... >>>> if not (<int>PetscInitializeCalled): return >>>> if (<int>PetscFinalizeCalled): return >>>> # deinstall custom error handler >>>> ierr = PetscPopErrorHandlerPython() >>>> if ierr != 0: >>>> fprintf(stderr, "PetscPopErrorHandler() failed " >>>> ^ >>>> ------------------------------------------------------------ >>>> >>>> /u/dalcinl/Devel/petsc4py-dev/src/PETSc/PETSc.pyx:307:24: Unicode >>>> literals do not support coercion to C types other than Py_UNICODE. >>> >>> Right, the parser reads the literal as unicode string here before type >>> analysis figures out that it's really meant to be a bytes literal. >>> >>> This will be hard to change as recovering the original bytes literal is >>> impossible once it's converted to a unicode string (remember that you can >>> use arbitrary character escape sequences in the literal). So I'm leaning >>> towards keeping this as an error. After all, Unicode string literals is one >>> of the things that a user explicitly requests with the -3 switch. >> >> How about allowing it for ASCII literals and leaving it an error if >> there are any codepoints in the literal outside the 0-127 range? > > It's not so unlikely that you find C (data) strings that contain (escaped) > non-ASCII characters. Those strings would need a 'b' prefix then. So you'd > end up with some C strings that work without prefix and others for which > you need a 'b', even if both clearly occur in a C char* context.
In my experience, non-ASCII literals are even more un-common than non-ASCII user data, but it would be really nice at least to handle the ASCII case smoothly. > The problem is, unprefixed string literals found in source code compiled by > Cython are equally likely to be meant as unicode strings, byte strings, C > strings or pymorphic strings these days. There isn't one obvious "do what I > mean" way. Remember that Lisandro brought this up because Cython reported > an *error* when compiling the code. I find that a lot better than silently > accepting something that may not have been meant that way. > > One thing we could do, however, is to parse all (unprefixed?) strings as > both unicode strings *and* byte strings. That would induce a (minor) bit of > overhead in the parser (both in terms of memory and speed), but it would > allow us to recover the original byte sequence of a Unicode string during > type analysis if we find that we need to coerce it to a byte string. > > In case we need to, we could then even write both types of byte sequences > into the string constant table in the C file, so that we can recover the > exact byte sequence and the correct Unicode character sequence depending on > the CPython runtime. How about we parse the literals as unicode strings, and if used in a bytes context we raise a compile time error if any characters are larger than a char? Thus "\u0001" would still be OK in a bytes context, but "\u1000" would not be (compile time error). It may even be better to set the limit to 127, as that is the truly unambiguous range, and require a prefix if you really want something more. - Robert _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
