Re: [Cython] Another string encoding idea

Robert Bradshaw Sat, 28 Nov 2009 03:03:17 -0800

On Nov 27, 2009, at 10:52 PM, Stefan Behnel wrote:

> Hi Robert,
>
> Robert Bradshaw, 27.11.2009 22:34:
>> I had an epiphany when I realized that I find this burdensome not
>> because the user needs to specify an encoding, but that they have to
>> manually handle it every time they deal with a char*. So, my proposal
>> is this: let the user specify via a compiler directive an encoding to
>> use for all conversions.
>
> Sounds better than defaulting to the Python system encoding (as Py2  
> does),
> which is unrelated to the encoding used by any C libraries etc. It's  
> also
> explicit.
>
> On the downside, while being explicit, it can still lead to all  
> sorts of
> unexpected behaviour for users because strings would pop up in non- 
> obvious
> types in their code.


I'm not following you here.

> Now the conversion from char* to bytes would have to
> be explicit, although it's certainly not uncommon when dealing with  
> C code,
> and totally normal in Py2.

Yes, though only when the directive is in place.

>
>> Cython could then transparently and
>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in
>> Py3, and unicode -> char* in Py2.
>
> As Greg pointed out, going directly from unicode to char* isn't  
> trivial to
> implement and the implications are certainly not obvious for most  
> users and
> not controllable by user code, so you can't just free memory by  
> setting a
> variable to None. I think that's straight out for not being explicit.

We might have to limit ourselves to the system default encoding, as  
there is a slot for that in the unicode object for just this purpose.  
At least it's always UTF-8 in Py3.

> Currently, coercion from char*/bytes to unicode is an explicit step  
> that is
> easy to do via
>
>    cdef char* s = ...
>    u = s[:length].decode('UTF-8')
>
> in 0.12. See
>
> http://trac.cython.org/cython_trac/ticket/436

That is an improvement, though still a lot more baggage than

cdef char* s = ...
u = s

> Your proposal would make that
>
>    # cython: bytes-encoding=UTF-8
>
>    cdef char* s = ...
>    cdef unicode u = s[:length]
>
> (well, I /hope/ you'd require the target to be typed, right?)
>
> or
>
>    # cython: bytes-encoding=UTF-8
>
>    cdef char* s = ...
>    cdef str py_s = s[:length]
>
> so you'd not really gain much in terms of typing and (IMO) loose  
> readability.

No, I wasn't thinking of requiring a typed target. That way you could do

cdef extern from "fooey.h":
    char* foo_c(char*)

def foo(char* s):
     return foo_c(s)

That seems like a big savings in terms of both readability and typing.  
This especially holds if the encoding step is completely orthogonal to  
the issue at hand. (It's also obvious for a new user to try and  
maintains compatibility with Pyrex and behaves gracefully in Py3.)

> Note that many encodings (e.g. the Asian 2-byte encodings) naturally
> contain 0 bytes, so automatic conversion of char* can't even work in  
> those
> cases, as only the user code would know the correct length of the  
> string.

Sure. One would have to manually specify the length in that case, if  
we even allowed such encodings as defaults. This isn't useful for  
everyone, but a huge audience is scientific users whose "strings" are  
getting passed to legacy fortran and C libraries, stuff like  
algorithms or flags or sensor IDs that are not likely to be non-ASCII,  
and who aren't doing "text processing" in any meaningful way.

> I'm +0.3 on the opposite way in Py2 for the 'str' type, though, as I
> already mentioned. I think that would a) fit the intention of users,  
> b)
> match the main use case of accepting both str and unicode as function
> arguments in Py2 (and only in Py2!), and c) be free of memory handling
> issues as the target would still be a Python object.
>
> So I think it makes sense to support this only in Py2, and only for  
> Python
> objects, not for char*.

So you're suggesting

def foo(str s):
     char* ss = s

would work fine in Py2 on unicode input, but not for Py3? And this is  
not a shortcut for

def foo(char* ss):
     ...

which would not be supported even if there was a directive in place to  
use the system default encoding?

> BTW, you keep talking about supporting all sorts of encodings here,  
> whereas
> the use cases you present seem to deal only with plain ASCII non- 
> textual
> data. Maybe it would be enough to make ASCII the default encoding for
> unicode->str coercion of function arguments in Py2 then? Or (as my  
> original
> proposal went) to use the platform encoding for this, as CPython  
> does, and
> which is normally ASCII in Py2 anyway.

I was just going for maximum flexibility. (A "default encoding"  
directive seems to naturally take a parameter.) My use cases are  
typically more limited. (Not that I've never dealt with non-latin  
characters, but when I'm doing actual text processing thinking about  
encodings seems more natural to have to deal with at every turn.)  
Supporting least UTF-8 would be ideal, as then all unicode objects  
could survive the round trip (and it would work well with any C  
library that expected null-terminated strings), but I think we're  
going to be constrained to use the default encoding due to memory  
issues, at least for any hope of unicode -> char*.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to