Hi,

Greg Ewing wrote:
> Keeping unicode and bytes clearly separated makes good
> sense in py3, because you're in a high-level world that's
> firmly isolated from the outside. It's a viable strategy
> to convert all your data to unicode as soon as it comes
> in, and not have to worry about the issue otherwise.

lxml works exactly the other way round. All unicode strings that come in are
converted to UTF-8 before doing anything else with them.


> But the inside of a Pyrex module isn't such an isolated
> environment. At every turn, you're dealing with C code
> that doesn't make such a clear distinction between bytes
> and unicode.

It usually won't know anything about Unicode anyway, at least not about Python
unicode strings. C code usually cares about bytes, in which case the inverse
of the approach you sketched above is exactly the right thing to do.


> I'm not sure that trying to maintain the
> distinction rigidly for Python data, when there is all
> this C data around that doesn't maintain any such
> distinction, is worth the effort.

I think you will always have to find some kind of lingua franca that is used
throughout your program. You may decide to convert all data coming from C
directly into a Python unicode string and work with that, or it may be more
suitable to convert all Python unicode input into bytes and work with those.
But it's just bad design to keep converting back and forth all over the place,
so I still think we are discussing a mixture of a non-issue and a potential
source of bugs here.

Stefan
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to