Re: [Cython] first lessons learned while porting lxml to Py3

Stefan Behnel Mon, 19 May 2008 09:56:31 -0700

Hi Lisandro,

Lisandro Dalcin wrote:
> After all your comments, my conclusion is the following:
>
> - Any code dealing with string processing should be written in a way
> were byte and unicode strings should have to be prefixed like b"abc"
> and u"abc".


I actually like the way it's in Py3. Unicode is the right thing most of
the time - except when you deal with C-APIs as in Cython, where the best
place to handle unicode is right below the API level, and nowhere else in
your code. :)


> Guido decided Python 3 not to
> support the u"abc" form. And now, from a user/developer perspective,
> althoug this contradict the "only one way...", I'm not sure at all if
> that was a good idea in practice.

It makes writing portable code very hard, just think of code that must
support Python 2.3-3.0. I'm currently wrapping all string literals in the
test cases in lxml with a function call _bytes() or _str(), which then
does the right thing depending on the runtime environment. But it's a
whole bunch of work to manually put this all over the place...


> - I still think that unprefixed forms should match the builtin 'str'
> type in Py2 and Py3. This way, things like docstrings, exception
> messages, and calls like getattr([], "append") will just work.

That's the problem: those are simple. Simple to find and simple to change,
even with a script. All other places where data handling is involved are
actually likely to break if we make it a general switch.

I could agree on automatic promotion of docstrings and maybe even
exception messages to unicode strings, but such a selective automatism
would be somewhat surprising to users. And I'm a big fan of "explicit is
better than implicit".

Actually, shipping Cython with a simple script that prefixes all
docstrings and "raise" messages with a 'u' would get us a lot of relief
here. Maybe someone could write such a beast?


> And of
> course, casting to a raw 'char*' pointer should only accept bytes (by
> using PyString_AsString). This is going to be safe in Py3, but not in
> Py2.

Even in my current implementation, the semantics are not entirely clean
here. There are still cases where an explicit string literal gets coerced
to the other type, in the DEF statement, for example.


> But Py2 is broken anyway, right?

It's still an important platform, though. Much more important than Py3. :)


> - Add a '-py3' command line flag to Cython (this will be needed in the
> future anyway, right?).

I've always seen that as a way to handle plain Python 3 source code, not
Py3-esque Cython code. I think the latter would be better served with one
or more explicit __future__ imports. Configuring source semantics outside
of the source is something that is hard to keep track of and that can get
difficult to manage when you combine code from different sources - which
is not so rare as it might seem.


> When this flag is active, then the following will happen at runtime:
>
> * Something like 'cdef char *p = obj' will only accept a byte string
> ('str' type in Py2, 'bytes' in Py3). If 'obj' is  an unicode string,
> the generated code raises a TypeError both on a Py2 and a Py3 runtime.

Right, this actually currently works (sort-of) in Py2:

    cdef char* val
    uval = u"abc"
    val = uval
    print repr(val)

prints 'abc' in Py2 and raises a TypeError in Py3. If you use non-ASCII
letters, however, this fails with a UnicodeDecodeError in Py2. It would
really be better if Cython catched that for the literal case and raised at
least a runtime TypeError in the case above. And I mean: always, not just
with a command line switch. As this will really help users by showing them
where work has to be done.


> * A new C pseudo-type have to be added, lets call it 'uchar' (better
> name would be needed, it can be confused with unsigned char). Then
> something like 'cdef uchar *p = obj' will only accept an unicode
> string ('unicode' type in Py2 and 'str' in Py3). If 'obj' is a byte
> string, the generated code raises TypeError both on a Py2 and a Py3
> runtime.

I assume you mean a conversion to UTF-8 here, in which case "utf8char"
would be appropriate IMHO. Still, I find

     s.encode("UTF-8")

so short and explicit, that I don't see a major need for a special type
name here. And in many, many cases, you will even be able to say

   def dostuff(text):
     cdef char* c_s
     text = text.encode("UTF-8")
     c_s = text
     ...

so you don't even need to care about GC or anything, as "text" will stay
alive during the function call.

Regarding the TypeError, this would do the trick:

   def dostuff(unicode text):
       ...


> This proposal just tries to make Cython/Pyrex
> stricter, more explicit,  even in a Py2 runtime.

I for one would appreciate such strict semantics even without a command
line switch. lxml has grown over a couple of years. If Cython had told me
that some things don't work that way and will stop working in Py3, I
wouldn't have to walk through the hassle of migrating my code now.


> Stefan, please do not get angry if all this is a non-sense.

I rarely bite. ;) And it's not nonsense at all.


> All this stuff is going to be real pain to all Python users

Yep, this really is work, especially if you do not have a handy (though
not even reliable) 2to3 tool at hand.


> About a week ago, I asked Fernando Perez and Brian Granger about
> suggestion about how they think I should handle the byte/unicode stuff
> in mpi4py. I paste below Fernando's anwer:
>
> """
> Wow, this one's going to be a *huge* thorn in everyone's side, if I
> understand the problem correctly.  What has been the policy of other
> projects?
> """

Regarding a policy, I have decided to get lxml's Cython code clean and the
Python code portable without 2to3. That's more work than trying to find
ways to cheat, but it's the right thing to do, and the safest option.


> And I believe this is the situation of many, many Python users and
> developers out there, even of the very smart and productive ones like
> Fernando. If Cython takes a good direction on all this, then this will
> be benefical for the Cython project itself, but also for other people
> and other projects to follow the right path.

I wouldn't mind finding ways to make the migration easier for users. I'm
just very, very reluctant to changes that can end up breaking correct code
or that keep people from thinking about the implications of the code they
write. For example, automatic conversion of plain ASCII byte strings to
unicode strings has a great potential of animating people to write code
that breaks right the first time someone passes a non-ASCII string in.
Getting code right at the time of its writing is very important to keep
the maintenance overhead low, and the tools should help here instead of
hiding potential bugs.

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] first lessons learned while porting lxml to Py3

Reply via email to