Hi,

I've been working on a fix for ticket #602, negative indexing for inferred 
char*.

http://trac.cython.org/cython_trac/ticket/602

Currently, when you write this:

     s = b'abc'

s is inferred as char*. This has several drawbacks. For one, we loose the 
length information, so "len(s)" becomes O(n) instead of O(1). Negative 
indexing fails completely because it will use pointer arithmetic, thus 
leaving the allocated memory area of the string. Also, code like the 
following is extremely inefficient because it requires multiple conversions 
from a char* of unknown length to a Python bytes object:

     s = b'abc'
     a = s1 + s
     b = s2 + s

I came to the conclusion that the right fix is to stop letting byte string 
literals start off as char*. This immediately fixes these issues and 
improves Python compatibility while still allowing automatic coercion, but 
it also comes with its own drawbacks.

In nogil blocks, you will have to explicitly declare a variable as char* 
when assigning a byte string literal to it, otherwise you'd get a compile 
time error for a Python object assignment. I think this is a minor issue as 
most users would declare their variables anyway when using nogil blocks. 
Given that there isn't much you can do with a Python string inside of a 
nogil block, we could also honour nogil blocks during type inference and 
automatically infer char* for literals here. I don't think it would hurt 
anyone to do that.

The second drawback is that it impacts type inference for char loops. 
Previously, you could write

     s = b'abc'
     for c in s:
         print c

and Cython would infer 'char' for c and print integer byte values. When s 
is inferred as 'bytes', c will be inferred as 'Python object' because 
Python 2 returns 1-byte strings and Python 3 returns integers on iteration. 
Thus the loop will run entirely in Python code and return different things 
in Py2 and Py3.

I do not expect that this is a major issue either. Iteration over literals 
should be rare, after all, and if the byte string is constructed in any 
way, the type either becomes a bytes object through Python operations (like 
concatenation) or is explicitly provided, e.g. as a return type of a 
function call. But it is a clear behavioural change for the type inference 
in an area where Cython's (and Python's) semantics are tricky anyway.

Personally, I think that the advantages outweigh the disadvantages here. 
Most common use cases won't notice the change because coercion will not be 
impacted, and most existing code (IMHO) either uses explicit typing or 
expects a Python bytes object anyway. So my preferred change would be to 
make byte string literals 'bytes' by default, except in nogil blocks.

Opinions?

Stefan

_______________________________________________
Cython-dev mailing list
Cython-dev@codespeak.net
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to