Re: [Cython] unicode can bite us ...

Stefan Behnel Tue, 24 Feb 2009 01:34:54 -0800

Robert Bradshaw wrote:
> On Feb 24, 2009, at 12:29 AM, Stefan Behnel wrote:
>> Well, at least, that's what's written in the code: a byte string.
>> What I'm
>> saying is that /requiring/ a byte string at the interface level is
>> wrong.
>
> I agree on this point. I'm not as convinced that accepting a byte
> string is wrong though.


It's not wrong to /accept/ one. But it's wrong to have your code fail when
someone passes a unicode string.

I normally vote for requiring unicode strings in APIs under Py3, but
that's something that needs to be decided for each case separately.


>>> I think you underestimate how long broken libraries will be out
>>> there.
>>
>> Let's wait and see. It didn't take me very long to fix up the Py3
>> unicode
>> problems of lxml's API (those that were independent of Cython), so
>> I would
>> expect that any library can be fixed in a couple of weeks
>
> I don't doubt most libraries could be made Py3 unicode compliant if
> someone were willing to spend "a couple of weeks" fixing it

I was actually speaking in terms of spare-time weeks rather than full-time
weeks.

It's all about fixing APIs. What I'd advocate (for Cython code at least)
is to pass all API string input through a helper function that does the
right thing, and to do the same for string output. That gives you a single
place for fixing things, and it only needs to be done once. Then there's
evil things like file name handling, but that's about it.

Even ParseArgs() and friends help you by accepting unicode strings for a
char* ("s"), as long as the ASCII codec can decode them. Which is
definitely the case for NumPy arguments, for example.


> I think
> it's more a question of motivation. Unicode support is something you
> care a lot about (and I'm glad you do, it's thanks to you we support
> it so well in Cython) and is also a very natural and important issue
> to deal with for an xml parser. People writing scientific libraries
> (for example) are probably more worried about endianness issues and
> fortran compatibility than unicode support, though for open source
> projects hopefully someone steps up and does it.

I think that Py3 makes programmers a lot more aware of these issues, and
provides a growing motivation for them to fix their code. It's not
uncommon for release announcements these days to contain at least a
comment on support for Py3, if not a success story. That really makes me
confident that these problems will go away rather sooner than later.

In the specific case of NumPy, there's also the IronClad project that
attempts a port to IronPython. They'll come up with their own set of fixes
anyway. I wouldn't be surprised if some of them were related to unicode
handling.


> We don't need a new syntax.
>
> def foo():
>      return "Something."
>
> should return a str object: bytes under Py2, and unicode under Py3.

:) didn't we have this discussion already?

What if you wanted to pass the result of that function into C code?

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Re: [Cython] unicode can bite us ...

Reply via email to