On Nov 28, 2009, at 7:01 AM, Stefan Behnel wrote:
> Robert Bradshaw, 28.11.2009 12:02:
>> On Nov 27, 2009, at 10:52 PM, Stefan Behnel wrote:
>>> On the downside, while being explicit, it can still lead to all
>>> sorts of
>>> unexpected behaviour for users because strings would pop up in non-
>>> obvious types in their code.
>>
>> I'm not following you here.
>
> You main point is that people shouldn't be bothered with bytes/str/
> unicode
> issues if they just deal with 'text'. This means that the goal is to
> make
> it easy to forget that they exist.
Close. The goal is to only have to think about the issue once, not
every single place char* is used.
> Having strings turn into unicode
> automatically will therefore easily lead to problems when dealing
> with byte
> strings, as it's just as easy to forget to decode a byte string into
> unicode as it is to forget to cast a byte string to <bytes> to make
> sure it
> does *not* becode a unicode string. Depending on the code, this may
> or may
> not be an issue, but it's something we are about to introduce here.
Yep. If you're wrapping a library that returns byte strings for some
function calls and unicode strings for other function calls, you're
going to have to be careful either way. I guess I'm making the
assumption here that for most purposes the bytes object is not what
one wants to expose to the user. (Even Python 3 does not return bytes
objects when wrapping obvious char* values like reads from files or
sys.argv.)
>
>>> Now the conversion from char* to bytes would have to
>>> be explicit, although it's certainly not uncommon when dealing with
>>> C code, and totally normal in Py2.
>>
>> Yes, though only when the directive is in place.
>
> And then the problem is that users will not use that directive from
> the
> start, and will have to fix their byte strings when they realise
> that it
> exists.
>
> Plus, byte strings are a lot faster and more memory efficient in Py2
> than
> unicode strings. Enabling such a directive means that all char*-
> >string
> coercions will drop the string into unicode, unless users
> specifically cast
> them back.
But often byte strings are not what the end user wants, in which case
you're already converting to unicode everywhere, just doing it manually.
>
>>>> Cython could then transparently and
>>>> efficiently handle all char* <-> str (a.k.a. unicode) encodings in
>>>> Py3, and unicode -> char* in Py2.
>>> As Greg pointed out, going directly from unicode to char* isn't
>>> trivial to
>>> implement and the implications are certainly not obvious for most
>>> users and
>>> not controllable by user code, so you can't just free memory by
>>> setting a
>>> variable to None. I think that's straight out for not being
>>> explicit.
>>
>> We might have to limit ourselves to the system default encoding, as
>> there is a slot for that in the unicode object for just this purpose.
>
> ... which, just to repeat it, is deprecated and may thus go away
> without
> further warning. How would you handle this case then?
I'm not sure. It depends on if it's just the idea of a "system default
encoding" that's deprecated, or if the slot containing a encoded
reference is going away.
>
>
>>> Currently, coercion from char*/bytes to unicode is an explicit step
>>> that is easy to do via
>>>
>>> cdef char* s = ...
>>> u = s[:length].decode('UTF-8')
>>>
>>> in 0.12. See
>>>
>>> http://trac.cython.org/cython_trac/ticket/436
>>
>> That is an improvement, though still a lot more baggage than
>>
>> cdef char* s = ...
>> u = s
>
> I just extended it to also speed up
>
> u = s.decode(enc)
>
> which users can deploy if the encoding supports it.
Thanks.
> I think that's easy
> enough for going from bytes/char* to unicode, which I really think
> is worth
> being an explicit step.
>
>
>>> I'm +0.3 on the opposite way in Py2 for the 'str' type, though, as I
>>> already mentioned. I think that would a) fit the intention of users,
>>> b)
>>> match the main use case of accepting both str and unicode as
>>> function
>>> arguments in Py2 (and only in Py2!), and c) be free of memory
>>> handling
>>> issues as the target would still be a Python object.
>>>
>>> So I think it makes sense to support this only in Py2, and only for
>>> Python objects, not for char*.
>>
>> So you're suggesting
>>
>> def foo(str s):
>> char* ss = s
>>
>> would work fine in Py2 on unicode input, but not for Py3?
>
> 1) Would you suggest that
>
> def foo(unicode s):
> char* c_s = s
>
> should accept str in Py2?
No, I don't see a motivation to support this. Also, specifying unicode
rather than str as pretty explicit.
> 2) Your example above has the same issues for unicode input in Py3
> as the
> plain char* example below, so, yes, I'm suggesting that
>
> cdef char* s = some_unicode_sting
>
> does not work without preventing users from controlling the required
> object
> allocation. Only auto-coding between Python strings can be automated
> in a
> user-friendly way.
Unless we use the defenc slot.
>
>> And this is not a shortcut for
>>
>> def foo(char* ss):
>> ...
>>
>> which would not be supported even if there was a directive in place
>> to
>> use the system default encoding?
>
> Correct. This is the same as
>
> cdef int* i = &(<int>some_python_int)
>
> and similar things. I don't see a reason to special case char* here.
The motivation to special case it is that strings are such a
fundamental data type.
Thanks for the feedback,
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev