Robert Bradshaw, 14.02.2013 06:51:
> I've proposed having a compiler
> directive that lets you specify an encoding (e.g. ascii, utf8) and
> automatically endodes/decodes when converting between C and Python
> strings.
My main objection against that is that it would only work in one direction,
from C strings to Python strings. The other direction requires an explicit
intermediate bytes object in order to correctly do the memory management,
so there's really nothing to win there. Doing anything implicit in that
direction would just call for either trouble or inefficiency.
For the first direction, C-to-Python, I don't see the major advantage
between the implicit
cdef unicode py_string = c_string # typing required here
and the explicit
py_string = c_string.decode('utf-8') # note: no typing here
There is only one case where it's a bit simpler:
py_string = c_string[:length] # no typing, auto-coercion
in contrast to
py_string = c_string[:length].decode('utf-8')
Anyway, it's just a couple of characters difference, which are best hidden
in an explicit "conversion + validation" function anyway. Auto-coercion of
C strings will always be more inefficient and error prone than users should
be asked to bare, and all we could add would only be the unidirectional
conversion part, not any validation or whatever user code has to do in
addition.
The situation is entirely different for C++ strings. They have an efficient
two-way auto-coercion and safely copy their content on creation. In their
case, auto-coercion would basically behave like
from __future__ import unicode_literals
but for string coercion. I have no objections against that. I think it just
needs implementing and then testing against a couple of real, existing code
bases to see what the real-world tradeoff is then. It's just a matter of
whether a user needs to write "<unicode>" or "<bytes>" in the right places.
All of that being said, the proposal sounds like it's actually two: 1)
specify an implicit encoding for coercion between C++ strings and Python
unicode strings, and 2) automatically coerce between C++ strings and Python
unicode strings by default. 1) means that
cdef libcpp.string cs1 = ..., cs2
py_string = <unicode>cs1
cs2 = py_string
would auto-decode and -encode the string, 2) means that
cdef libcpp.string cs1 = ..., cs2
py_string = <object>cs1
cs2 = py_string
would do it (including any implicit coercions to Python objects). If 2) is
desirable at all, I think it makes sense to fold that into two separate
directives, as many users will be better off without the second one.
There's also the question whether you want coercion to and from "unicode"
or to and from "str". Getting the latter right wouldn't be easy, most
likely neither for us nor for users who want to apply it to their code.
However, given that the only use case for that would be Py2 backwards
compatibility, waiting a couple of years longer should nicely solve this
problem for us. No need to burden the compiler with it now.
Stefan
_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel