Hi, I've been working on a fix for ticket #602, negative indexing for inferred char*.
http://trac.cython.org/cython_trac/ticket/602 Currently, when you write this: s = b'abc' s is inferred as char*. This has several drawbacks. For one, we loose the length information, so "len(s)" becomes O(n) instead of O(1). Negative indexing fails completely because it will use pointer arithmetic, thus leaving the allocated memory area of the string. Also, code like the following is extremely inefficient because it requires multiple conversions from a char* of unknown length to a Python bytes object: s = b'abc' a = s1 + s b = s2 + s I came to the conclusion that the right fix is to stop letting byte string literals start off as char*. This immediately fixes these issues and improves Python compatibility while still allowing automatic coercion, but it also comes with its own drawbacks. In nogil blocks, you will have to explicitly declare a variable as char* when assigning a byte string literal to it, otherwise you'd get a compile time error for a Python object assignment. I think this is a minor issue as most users would declare their variables anyway when using nogil blocks. Given that there isn't much you can do with a Python string inside of a nogil block, we could also honour nogil blocks during type inference and automatically infer char* for literals here. I don't think it would hurt anyone to do that. The second drawback is that it impacts type inference for char loops. Previously, you could write s = b'abc' for c in s: print c and Cython would infer 'char' for c and print integer byte values. When s is inferred as 'bytes', c will be inferred as 'Python object' because Python 2 returns 1-byte strings and Python 3 returns integers on iteration. Thus the loop will run entirely in Python code and return different things in Py2 and Py3. I do not expect that this is a major issue either. Iteration over literals should be rare, after all, and if the byte string is constructed in any way, the type either becomes a bytes object through Python operations (like concatenation) or is explicitly provided, e.g. as a return type of a function call. But it is a clear behavioural change for the type inference in an area where Cython's (and Python's) semantics are tricky anyway. Personally, I think that the advantages outweigh the disadvantages here. Most common use cases won't notice the change because coercion will not be impacted, and most existing code (IMHO) either uses explicit typing or expects a Python bytes object anyway. So my preferred change would be to make byte string literals 'bytes' by default, except in nogil blocks. Opinions? Stefan _______________________________________________ Cython-dev mailing list Cython-dev@codespeak.net http://codespeak.net/mailman/listinfo/cython-dev