Re: [Cython] String types with Python 2.x and 3.x

Robert Bradshaw Mon, 14 Sep 2009 18:57:05 -0700

On Sep 14, 2009, at 12:05 PM, Stefan Behnel wrote:

> Robert Bradshaw wrote:
>> On Sep 13, 2009, at 12:39 PM, Stefan Behnel wrote:
>>>>>        cdef str s = "some string"
>>>>>        cdef char* cs = s
>>>>>
>>>> I'm inclined for a warning... and that warning would not be  
>>>> generated
>>>> in this case: "cdef char*cs = <bytes>s" , right?
>>> Sure.
>>
>> That could be bad, <bytes>s doesn't actually do a typecheck,
>> especially if the bytes -> char* is eventually optimized. One should
>> do <bytes?>s or <object>s (neither of which generate a warning).
>
> To me, that's just like casting an int to a void*. I don't see a  
> reason to
> special case some casts while we already allow all that dangerous C  
> stuff.
> If nothing else, a cast is a clear way to say "I know better!". And  
> if you
> actually do not know better, you'll see where that gets you. Not  
> Cython's
> problem.


Yes, as I said I was just saying that we shouldn't encourage *this*  
solution, as it doesn't do type checking.

>>> changing the argument/return value types from "object" to the
>>> right types will allow Cython to do actual type checking.
>>
>> Often the type checking will be redundant with the type checking that
>> happens inside the method, so I'm not so sure this is a good idea.
>
> I meant compile time type checking, which won't hurt performance  
> but helps
> in making the C-API safer and also allows Cython to do some  
> optimisations.

Sometimes. For example, PyUnicode_GetSize in principle take a unicode  
object, but is only typed to take a object. It performs its own  
typecheck, so we should just define it as taking an object and not do  
the redundant type check ourselves.

> For example, I only noticed recently that literal Python strings were
> always treated as "object" in Cython. So things like u"".join()  
> were never
> associated with the unicode type.

Yes, if u"" is typed, we should be able to optimize on it.

>>>>> And "str", "bytes" and "unicode" wouldn't be assignable to each
>>>>> other,
>>>>> right? Or would you also leave that to runtime?
>>>> "bytes" <-> "unicode" (obviously?) would not be assignable,  
>>>> tough for
>>>> the case of "bytes" <-> "str" or "str" <-> "unicode", we could
>>>> generate similar Cython compile warnings as for the "[unsigned ] 
>>>> char
>>>> *" conversions.
>>> Yes, I guess that's a similar case.
>>
>> I'd be inclined to outright disallow them, favoring requiring <bytes?
>>> or <unicode?> or <object> cast.
>
> Perfectly fine with me.
>
>
>> Currently, though, I can't think
>> of any reason to type str/bytes/unicode variables at all.
>
> You should take a look at the call optimisations for builtin types.  
> I've
> been adding to them for a while now, and they really make a huge  
> difference.
>
> For example, this:
>
>       cdef unicode u = some_unicode_string
>       s = u.encode('UTF-8')
>
> will now result in a straight C call to the UTF-8 encoder, instead of
> looking up the method, calling it, and having it look up the codec
> internally. I find that pretty cool.

Hmm, not for me (at least not in the -devel branch), but I could see  
this being very nice.

- Robert

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] String types with Python 2.x and 3.x

Reply via email to