> - Have a seperate mechanism for specifying what encoding should be used
> for conversion to C buffers. One solution is command-line options;
> however this is also a candidate for a Cython language extensions, as
> the "right" answer really depends on what encoding the C library you are
> calling is using! (char* is basically "encoding-less" in itself). One
> might even hard-code it to ASCII or latin1 for now.
>
I don't know much about this, but at least in the Linux world it looks
like C libraries will usually use the encoding specified in the current
locale (for instance, if you're on an UTF-8 system, like I am, then
glibc fopen will expect UTF-8 character data).
This can be exemplified by a Cython program printing a text using libc:
# coding: utf-8
# declare libc printf...
def usage():
printf("Usage: Å\n")
Keep in mind that there are three environments: The system of the Cython
developer (developer's local workstation), the system for compilation
(might be a big build-farm on a system with a different encoding), and
the runtime system (end-user workstation, might have a third encoding).
- What will happen now: The character will be output on screen using the
encoding of the developer who wrote the Cython program, no matter what
the encoding is on the compilation system or runtime system.
- What should happen: Whatever is an "Ø" should be output on the ta
One solution here is to detect string literals that doesn't contain
ASCII characters and always make them into Python unicode objects
(explicitly using the encoding of the source file upon the construction
of the unicode object), and call encode (or any Py3 equivalent) at
runtime (on module load, for instance) to generate the required char*
buffer for the target system (the build system is then kept out of the
loop).
Ie the above code will be generated to something like (very psuedo-code):
char* sourcefileencoding = "utf-8";
char* strcnst1_bytesbuf = { 0x55, 0x73, 0x61, 0x67, 0x65, 0x3a, 0x20,
0xc3, 0x85, 0x0a }
PyObject* strcnst1_pyobj = PyObjectNewUnicodeWhatever(strcnst1_bytesbuf,
sourcefileencoding); // on module load...
function __pyx_..._usage(PyObject* self, PyObject* args) {
...
char* __pyx_1;
EncodeToCurrentSystemLocale(strcnst1_pyobj, __pyx_1);
fprintf(__pyx_1);
}
...you get the idea. Note that the "Å" ends up as to hex characters in
utf-8. I suppose an alternative would be to standardize on utf-8 in C
source files and use unicode espace sequences in strings for all
non-ASCII, rather than the rather less readable hex sequence above.
Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev