Hi, Dominic Sacré wrote: > I'm trying to make a Pyrex/Cython module that was originally written for > Python 2.x work with Python 3.x, while at the same time keeping it > compatible with older versions. > > It seems like when using Python 3.x, Cython will automatically replace > 'unicode' with 'str', and 'str' with 'bytes'. Also, string literals are > interpreted as 'bytes' unless prefixed with 'u'.
Correct. > However, 'bytes' is not really useful in a context where an actual > string is expected You mean "text", I suppose? "string" is ambiguous as it can refer to C strings, Python byte strings and Python Unicode strings. > and causes problems for example when working with > strings passed from Python. > (One of many issues I have run into is the fact that b"foo" != "foo"...) Yep, and that's a really good thing. I fixed loads of those in Cython lately, and tons of them in the test suite. > The only solution I've found to at least get most of my code working is > basically to use unicode for almost everything That's the way to go anyway. To make the code Unicode aware, you have to make it distinguish between text, encoded text and data. > but if possible I'd like to avoid unicode strings in the 2.x version. That's not impossible, but it certainly is some work and the benefit is rather questionable, as it can easily bite you if you do not take care about the three-fold separation above. I do this in lxml as the API dictates that under Py2, ASCII compatible byte strings are accepted and returned as ASCII encoded byte strings. I actually work completely with UTF-8 encoded strings inside of lxml and use dedicated functions for checking and encoding everything that comes through the API or that goes back to the user. The main theme is to decide if you want to work with unicode internally or with encoded byte strings. Choose one or the other, not both. And make sure you check byte strings that contain text on the way in and reject them in the face of encoding ambiguity. In any case, data byte strings should remain unchanged, although you may run into all sorts of problems with file names (which are really text but that won't necessarily help you when trying to find them in an encoded file system, or when a user passes you an encoded URL that came from whatever source). > Is there a sane way to use the native string type (i.e. 'str') in either > Python version? ... and have Cython automatically encode and decode the byte strings for you? No, certainly not. Encoding is an explicit operation and it will make your code safer to make it explicit. Stefan _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
