Re: [Cython] first lessons learned while porting lxml to Py3

Stefan Behnel Tue, 20 May 2008 00:19:58 -0700

Hi Greg,

Greg Ewing wrote:
> I'm convinced that unrestricted automatic conversion between
> char * and unicode would be a bad idea. I'm not yet totally
> convinced that Pyrex shouldn't allow it under certain
> conditions, such as the string containing only ascii code
> points (checked at run time).


That would be the way Py2 behaves.


> For Pyrex, I'm also thinking about not trying to make the
> language match py3 at all, at least not in every way. For
> example, I may decide to keep the 'u' prefix for Python
> unicode literals.

I agree that this would not hurt much. Cython currently allows it.


> This probably isn't the right thing for Cython to do if it
> wants to be a pure-Python compiler, but Pyrex has a different
> goal -- it's meant to be a half-way house between Python
> and C.

Cython has the same goal, meaning that it tries to simplify the work
between Python code and C code. But additionally, it wants to support as
much of the Python language itself as possible, to lower the entry level
for Python programmers, which are the main target audience after all.

I think the targeted work-flow will always be: write it in Python, add the
C calls to connect to external libraries, optimise by adding type
declarations to the Python code. And one of the main goals of Cython is to
reduce the need for the last step as much as possible. I think that's what
will eventually make it a "pure Python compiler".


> Currently in Pyrex, "xxx" is not a Python type at all --
> it's a C type (i.e. char *). It only becomes a Python type
> when used in a Python context, forcing conversion to a
> Python string object.
>
> I don't think it's necessarily wrong to keep it that way,
> i.e. "xxx" is a C string, and if you want a Python string
> object as a literal, you have to say which kind you want
> with a "b" or "u" prefix.

Makes sense to me, although

    cdef char* s = b"..."

would still be possible and done at compile time, so it's not quite as
simple.


> That way, the Pyrex language itself can stay much the same,
> and you just have to write code that takes care to accept
> unicode strings if you intend to use it in a py3 environment.

I would say: regardless of the environment. Not checking string input is a
bug IMHO.


>> * A new C pseudo-type have to be added, lets call it 'uchar' (better
>> name would be needed, it can be confused with unsigned char). Then
>> something like 'cdef uchar *p = obj' will only accept an unicode
>> string
>
> What would it actually point to -- utf8 encoded chars?

I guess so.


> How would it interact with char *?

Good question. :)

That raises the question what a uchar* is good for it you can just assign
it to a char* variable. Then you'd have to use this quirk to assign a
unicode string as UTF-8 encoded data to a char*:

    cdef char* s = <uchar*>u_str

Vicious! :)

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] first lessons learned while porting lxml to Py3

Reply via email to