Re: [Cython] Another string encoding idea

Stefan Behnel Sat, 28 Nov 2009 07:02:20 -0800

Robert Bradshaw, 28.11.2009 12:02:
> On Nov 27, 2009, at 10:52 PM, Stefan Behnel wrote:
>> On the downside, while being explicit, it can still lead to all  
>> sorts of
>> unexpected behaviour for users because strings would pop up in non- 
>> obvious types in their code.
> 
> I'm not following you here.


You main point is that people shouldn't be bothered with bytes/str/unicode
issues if they just deal with 'text'. This means that the goal is to make
it easy to forget that they exist. Having strings turn into unicode
automatically will therefore easily lead to problems when dealing with byte
strings, as it's just as easy to forget to decode a byte string into
unicode as it is to forget to cast a byte string to <bytes> to make sure it
does *not* becode a unicode string. Depending on the code, this may or may
not be an issue, but it's something we are about to introduce here.


>> Now the conversion from char* to bytes would have to
>> be explicit, although it's certainly not uncommon when dealing with  
>> C code, and totally normal in Py2.
> 
> Yes, though only when the directive is in place.

And then the problem is that users will not use that directive from the
start, and will have to fix their byte strings when they realise that it
exists.

Plus, byte strings are a lot faster and more memory efficient in Py2 than
unicode strings. Enabling such a directive means that all char*->string
coercions will drop the string into unicode, unless users specifically cast
them back.


>>> Cython could then transparently and
>>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in
>>> Py3, and unicode -> char* in Py2.
>> As Greg pointed out, going directly from unicode to char* isn't  
>> trivial to
>> implement and the implications are certainly not obvious for most  
>> users and
>> not controllable by user code, so you can't just free memory by  
>> setting a
>> variable to None. I think that's straight out for not being explicit.
> 
> We might have to limit ourselves to the system default encoding, as  
> there is a slot for that in the unicode object for just this purpose.  

... which, just to repeat it, is deprecated and may thus go away without
further warning. How would you handle this case then?


>> Currently, coercion from char*/bytes to unicode is an explicit step  
>> that is easy to do via
>>
>>    cdef char* s = ...
>>    u = s[:length].decode('UTF-8')
>>
>> in 0.12. See
>>
>> http://trac.cython.org/cython_trac/ticket/436
> 
> That is an improvement, though still a lot more baggage than
> 
> cdef char* s = ...
> u = s

I just extended it to also speed up

    u = s.decode(enc)

which users can deploy if the encoding supports it. I think that's easy
enough for going from bytes/char* to unicode, which I really think is worth
being an explicit step.


>> I'm +0.3 on the opposite way in Py2 for the 'str' type, though, as I
>> already mentioned. I think that would a) fit the intention of users,  
>> b)
>> match the main use case of accepting both str and unicode as function
>> arguments in Py2 (and only in Py2!), and c) be free of memory handling
>> issues as the target would still be a Python object.
>>
>> So I think it makes sense to support this only in Py2, and only for  
>> Python objects, not for char*.
> 
> So you're suggesting
> 
> def foo(str s):
>      char* ss = s
> 
> would work fine in Py2 on unicode input, but not for Py3?

1) Would you suggest that

    def foo(unicode s):
        char* c_s = s

should accept str in Py2?

2) Your example above has the same issues for unicode input in Py3 as the
plain char* example below, so, yes, I'm suggesting that

    cdef char* s = some_unicode_sting

does not work without preventing users from controlling the required object
allocation. Only auto-coding between Python strings can be automated in a
user-friendly way.


> And this is not a shortcut for
> 
> def foo(char* ss):
>      ...
> 
> which would not be supported even if there was a directive in place to  
> use the system default encoding?

Correct. This is the same as

    cdef int* i = &(<int>some_python_int)

and similar things. I don't see a reason to special case char* here.

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to