> - Have a seperate mechanism for specifying what encoding should be used 
> for conversion to C buffers. One solution is command-line options; 
> however this is also a candidate for a Cython language extensions, as 
> the "right" answer really depends on what encoding the C library you are 
> calling is using! (char* is basically "encoding-less" in itself). One 
> might even hard-code it to ASCII or latin1 for now.
>   
I don't know much about this, but at least in the Linux world it looks 
like C libraries will usually use the encoding specified in the current 
locale (for instance, if you're on an UTF-8 system, like I am, then 
glibc fopen will expect UTF-8 character data).

This can be exemplified by a Cython program printing a text using libc:

# coding: utf-8
# declare libc printf...
def usage():
  printf("Usage: Å\n")

Keep in mind that there are three environments: The system of the Cython 
developer (developer's local workstation), the system for compilation 
(might be a big build-farm on a system with a different encoding), and 
the runtime system (end-user workstation, might have a third encoding).

- What will happen now: The character will be output on screen using the 
encoding of the developer who wrote the Cython program, no matter what 
the encoding is on the compilation system or runtime system.
- What should happen: Whatever is an "Ø" should be output on the ta

One solution here is to detect string literals that doesn't contain 
ASCII characters and always make them into Python unicode objects 
(explicitly using the encoding of the source file upon the construction 
of the unicode object), and call encode (or any Py3 equivalent) at 
runtime (on module load, for instance) to generate the required char* 
buffer for the target system (the build system is then kept out of the 
loop).

Ie the above code will be generated to something like (very psuedo-code):

char* sourcefileencoding = "utf-8";
char* strcnst1_bytesbuf = { 0x55, 0x73, 0x61, 0x67, 0x65, 0x3a, 0x20, 
0xc3, 0x85, 0x0a  }
PyObject* strcnst1_pyobj = PyObjectNewUnicodeWhatever(strcnst1_bytesbuf, 
sourcefileencoding); // on module load...

function __pyx_..._usage(PyObject* self, PyObject* args) {
 ...
  char* __pyx_1;
  EncodeToCurrentSystemLocale(strcnst1_pyobj, __pyx_1);
  fprintf(__pyx_1);
}

...you get the idea. Note that the "Å" ends up as to hex characters in 
utf-8. I suppose an alternative would be to standardize on utf-8 in C 
source files and use unicode espace sequences in strings for all 
non-ASCII, rather than the rather less readable hex sequence above.

Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to