Hi, Greg Ewing wrote: > Keeping unicode and bytes clearly separated makes good > sense in py3, because you're in a high-level world that's > firmly isolated from the outside. It's a viable strategy > to convert all your data to unicode as soon as it comes > in, and not have to worry about the issue otherwise.
lxml works exactly the other way round. All unicode strings that come in are converted to UTF-8 before doing anything else with them. > But the inside of a Pyrex module isn't such an isolated > environment. At every turn, you're dealing with C code > that doesn't make such a clear distinction between bytes > and unicode. It usually won't know anything about Unicode anyway, at least not about Python unicode strings. C code usually cares about bytes, in which case the inverse of the approach you sketched above is exactly the right thing to do. > I'm not sure that trying to maintain the > distinction rigidly for Python data, when there is all > this C data around that doesn't maintain any such > distinction, is worth the effort. I think you will always have to find some kind of lingua franca that is used throughout your program. You may decide to convert all data coming from C directly into a Python unicode string and work with that, or it may be more suitable to convert all Python unicode input into bytes and work with those. But it's just bad design to keep converting back and forth all over the place, so I still think we are discussing a mixture of a non-issue and a potential source of bugs here. Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
