> "In the face of ambiguity, refuse the temptation to guess." :)
>
> Somehow "inferring" the difference between str and unicode literals is the
> wrong thing to do.
>
I don't think I explained my question well enough; I'll try again.
The thing is, this kind of inferring already happens; you can do
cdef char c = "c"
and the string literal "c" becomes a single character value, while you
can do
cdef char* s = "hello"
and you get a C string literal (which is passed through straight from
Cython source), while
py_s = "hello"
gives a Python object. Somehow the "natural" thing to do for Py3 is to
continue allowing "direct" assignments to char* of the type above; but
generate unicode objects on coercion to Python object. (Hmm. So the
problem is that one can no longer auto-coerce from Python string objects
to char*...)
Hmm. This might come from a wrong understanding of the problem, but from
my limited knowledge, it looks like the reason we get this problem is
because the current Cython behaviour is wrong, even in a Python 2.6
context. Suggestion:
- Support PEP 263 as you say. This is for *input* from Cython source
*only*; the whole point is that whether you edit your source files on a
UTF-8 or BIG-5 system shouldn't impact anything about runtime behaviour
as long as you declare the encoding of the source file.
- Have a seperate mechanism for specifying what encoding should be used
for conversion to C buffers. One solution is command-line options;
however this is also a candidate for a Cython language extensions, as
the "right" answer really depends on what encoding the C library you are
calling is using! (char* is basically "encoding-less" in itself). One
might even hard-code it to ASCII or latin1 for now.
- String literals to buffers (cdef char* s = "hello") are reencoded in
Cython compilation to the right target encoding, so that if latin1 is
specified for the C library in question I can get correct results
editing the Cython source in UTF-8. In fact, for maximum portability of
C source, one can use the literal if only ASCII is used, and otherwise
generate stuff like
char* s = {-20, 54, 50, 0}
. If there's a mismatch between input and output encoding (I defined the
C library I'm calling as ASCII but try to use my native "øåæÅØ") then
it's a compile-time error.
- On coercions from Python strings (unicode or whatever) to char*, the
same reencoding is used (call s.encode(ENCODING) or similar). This will
raise the appropiate exceptions.
It would be good to solve this anyway and I fail to see the connection
with Python 3, and I definitely don't think that Cython behaviuor needs
to be different between the two (even if everything is unicode in Python
3 there should be functionality somewhere in the library to generate
byte data in other encodings?)
Dag Sverre
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev