Re: [Cython] [cython-users] Re: Python 3 and string frustration

Stefan Behnel Wed, 13 Feb 2013 23:05:37 -0800

Robert Bradshaw, 14.02.2013 06:51:
> I've proposed having a compiler
> directive that lets you specify an encoding (e.g. ascii, utf8) and
> automatically endodes/decodes when converting between C and Python
> strings.


My main objection against that is that it would only work in one direction,
from C strings to Python strings. The other direction requires an explicit
intermediate bytes object in order to correctly do the memory management,
so there's really nothing to win there. Doing anything implicit in that
direction would just call for either trouble or inefficiency.

For the first direction, C-to-Python, I don't see the major advantage
between the implicit

    cdef unicode py_string = c_string      # typing required here

and the explicit

    py_string = c_string.decode('utf-8')   # note: no typing here

There is only one case where it's a bit simpler:

    py_string = c_string[:length]          # no typing, auto-coercion

in contrast to

    py_string = c_string[:length].decode('utf-8')

Anyway, it's just a couple of characters difference, which are best hidden
in an explicit "conversion + validation" function anyway. Auto-coercion of
C strings will always be more inefficient and error prone than users should
be asked to bare, and all we could add would only be the unidirectional
conversion part, not any validation or whatever user code has to do in
addition.


The situation is entirely different for C++ strings. They have an efficient
two-way auto-coercion and safely copy their content on creation. In their
case, auto-coercion would basically behave like

    from __future__ import unicode_literals

but for string coercion. I have no objections against that. I think it just
needs implementing and then testing against a couple of real, existing code
bases to see what the real-world tradeoff is then. It's just a matter of
whether a user needs to write "<unicode>" or "<bytes>" in the right places.


All of that being said, the proposal sounds like it's actually two: 1)
specify an implicit encoding for coercion between C++ strings and Python
unicode strings, and 2) automatically coerce between C++ strings and Python
unicode strings by default. 1) means that

    cdef libcpp.string cs1 = ..., cs2

    py_string = <unicode>cs1
    cs2 = py_string

would auto-decode and -encode the string, 2) means that

    cdef libcpp.string cs1 = ..., cs2

    py_string = <object>cs1
    cs2 = py_string

would do it (including any implicit coercions to Python objects). If 2) is
desirable at all, I think it makes sense to fold that into two separate
directives, as many users will be better off without the second one.


There's also the question whether you want coercion to and from "unicode"
or to and from "str". Getting the latter right wouldn't be easy, most
likely neither for us nor for users who want to apply it to their code.
However, given that the only use case for that would be Py2 backwards
compatibility, waiting a couple of years longer should nicely solve this
problem for us. No need to burden the compiler with it now.

Stefan

_______________________________________________
cython-devel mailing list
[email protected]
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] [cython-users] Re: Python 3 and string frustration

Reply via email to