Re: [Cython] Another string encoding idea

Robert Bradshaw Sat, 28 Nov 2009 12:44:38 -0800

On Nov 28, 2009, at 7:01 AM, Stefan Behnel wrote:

> Robert Bradshaw, 28.11.2009 12:02:
>> On Nov 27, 2009, at 10:52 PM, Stefan Behnel wrote:
>>> On the downside, while being explicit, it can still lead to all
>>> sorts of
>>> unexpected behaviour for users because strings would pop up in non-
>>> obvious types in their code.
>>
>> I'm not following you here.
>
> You main point is that people shouldn't be bothered with bytes/str/ 
> unicode
> issues if they just deal with 'text'. This means that the goal is to  
> make
> it easy to forget that they exist.


Close. The goal is to only have to think about the issue once, not  
every single place char* is used.

> Having strings turn into unicode
> automatically will therefore easily lead to problems when dealing  
> with byte
> strings, as it's just as easy to forget to decode a byte string into
> unicode as it is to forget to cast a byte string to <bytes> to make  
> sure it
> does *not* becode a unicode string. Depending on the code, this may  
> or may
> not be an issue, but it's something we are about to introduce here.

Yep. If you're wrapping a library that returns byte strings for some  
function calls and unicode strings for other function calls, you're  
going to have to be careful either way. I guess I'm making the  
assumption here that for most purposes the bytes object is not what  
one wants to expose to the user. (Even Python 3 does not return bytes  
objects when wrapping obvious char* values like reads from files or  
sys.argv.)

>
>>> Now the conversion from char* to bytes would have to
>>> be explicit, although it's certainly not uncommon when dealing with
>>> C code, and totally normal in Py2.
>>
>> Yes, though only when the directive is in place.
>
> And then the problem is that users will not use that directive from  
> the
> start, and will have to fix their byte strings when they realise  
> that it
> exists.
>
> Plus, byte strings are a lot faster and more memory efficient in Py2  
> than
> unicode strings. Enabling such a directive means that all char*- 
> >string
> coercions will drop the string into unicode, unless users  
> specifically cast
> them back.

But often byte strings are not what the end user wants, in which case  
you're already converting to unicode everywhere, just doing it manually.

>
>>>> Cython could then transparently and
>>>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in
>>>> Py3, and unicode -> char* in Py2.
>>> As Greg pointed out, going directly from unicode to char* isn't
>>> trivial to
>>> implement and the implications are certainly not obvious for most
>>> users and
>>> not controllable by user code, so you can't just free memory by
>>> setting a
>>> variable to None. I think that's straight out for not being  
>>> explicit.
>>
>> We might have to limit ourselves to the system default encoding, as
>> there is a slot for that in the unicode object for just this purpose.
>
> ... which, just to repeat it, is deprecated and may thus go away  
> without
> further warning. How would you handle this case then?

I'm not sure. It depends on if it's just the idea of a "system default  
encoding" that's deprecated, or if the slot containing a encoded  
reference is going away.

>
>
>>> Currently, coercion from char*/bytes to unicode is an explicit step
>>> that is easy to do via
>>>
>>>   cdef char* s = ...
>>>   u = s[:length].decode('UTF-8')
>>>
>>> in 0.12. See
>>>
>>> http://trac.cython.org/cython_trac/ticket/436
>>
>> That is an improvement, though still a lot more baggage than
>>
>> cdef char* s = ...
>> u = s
>
> I just extended it to also speed up
>
>    u = s.decode(enc)
>
> which users can deploy if the encoding supports it.

Thanks.

> I think that's easy
> enough for going from bytes/char* to unicode, which I really think  
> is worth
> being an explicit step.
>
>
>>> I'm +0.3 on the opposite way in Py2 for the 'str' type, though, as I
>>> already mentioned. I think that would a) fit the intention of users,
>>> b)
>>> match the main use case of accepting both str and unicode as  
>>> function
>>> arguments in Py2 (and only in Py2!), and c) be free of memory  
>>> handling
>>> issues as the target would still be a Python object.
>>>
>>> So I think it makes sense to support this only in Py2, and only for
>>> Python objects, not for char*.
>>
>> So you're suggesting
>>
>> def foo(str s):
>>     char* ss = s
>>
>> would work fine in Py2 on unicode input, but not for Py3?
>
> 1) Would you suggest that
>
>    def foo(unicode s):
>        char* c_s = s
>
> should accept str in Py2?

No, I don't see a motivation to support this. Also, specifying unicode  
rather than str as pretty explicit.

> 2) Your example above has the same issues for unicode input in Py3  
> as the
> plain char* example below, so, yes, I'm suggesting that
>
>    cdef char* s = some_unicode_sting
>
> does not work without preventing users from controlling the required  
> object
> allocation. Only auto-coding between Python strings can be automated  
> in a
> user-friendly way.

Unless we use the defenc slot.

>
>> And this is not a shortcut for
>>
>> def foo(char* ss):
>>     ...
>>
>> which would not be supported even if there was a directive in place  
>> to
>> use the system default encoding?
>
> Correct. This is the same as
>
>    cdef int* i = &(<int>some_python_int)
>
> and similar things. I don't see a reason to special case char* here.

The motivation to special case it is that strings are such a  
fundamental data type.

Thanks for the feedback,

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] Another string encoding idea

Reply via email to