On Sat, Jan 29, 2011 at 2:35 AM, Stefan Behnel <stefan...@behnel.de> wrote: > Robert Bradshaw, 29.01.2011 10:01: >> On Fri, Jan 28, 2011 at 11:37 PM, Stefan Behnel wrote: >>> there is a recent discussion on python-dev about a new memory layout for >>> the unicode type in CPython 3.3(?), proposed by Martin von Löwis (so it's >>> serious ;) >>> >>> http://comments.gmane.org/gmane.comp.python.devel/120784 >> >> That's an interesting PEP, I like it. > > Yep, after some discussion, I started liking it too. Even if it means I'll > have to touch a lot of code in Cython again. ;) > > >>> If nothing else, it gave me a new view on Py_UCS4 (basically a 32bit >>> unsigned int), which I had completely lost from sight. It's public and >>> undocumented and has been there basically forever, but it's a much nicer >>> type to support than Py_UNICODE, which changes size based on build time >>> options. Py_UCS4 is capable of representing any Unicode code point on any >>> platform. >>> >>> So, I'm proposing to switch from the current Py_UNICODE support to Py_UCS4 >>> internally (without breaking user code which can continue to use either of >>> the two explicitly). This means that loops over unicode objects will infer >>> Py_UCS4 as loop variable, as would indexing. It would basically become the >>> native C type that 1 character unicode strings would coerce to and from. >>> Coercion from Py_UCS4 to Py_UNICODE would raise an exception if the value >>> is too large in the given CPython runtime, as would write access to unicode >>> objects (in case anyone really does that) outside of the platform specific >>> Py_UNICODE value range. Writing to unicode buffers will be dangerous and >>> tricky anyway if the above PEP gets accepted. >> >> I am a bit concerned about the performance overhead of the Py_UCS4 to >> Py_UNICODE coercion (e.g. if constructing a Py_UNICODE* by hand), but >> maybe that's both uncommon and negligible. > > I think so. If users deal with Py_UNICODE explicitly, they'll likely type > their respective variables anyway, so that there won't be an intermediate > step through Py_UCS4. And on 32bit Unicode builds this isn't an issue at all. > > >>> One open question that I see is whether we should handle surrogate pairs >>> automatically. They are basically a split of large Unicode code point >>> values (>65535) into two code points in specific ranges that are safe to >>> detect. So we could allow a 2 'character' surrogate pair in a unicode >>> string to coerce to one Py_UCS4 character and coerce that back into a >>> surrogate pair string if the runtime uses 16 bit for Py_UNICODE. Note that >>> this would only work for single characters, not for looping or indexing >>> (without the PEP, that is). So it's somewhat inconsistent. It would work >>> well for literals, though. Also, we'd have to support it for 'in' tests, as >>> a Py_UCS4 value may simply not be in a Py_UNICODE buffer, even though the >>> character is in the string. >> >> No, I don't think we should handle surrogate pairs automatically, at >> least without making it optional--this could be a significant >> performance impact with little benefit for most users. Using these >> higher characters is rare, but using them on a non USS4 build is >> probably even rarer. > > Well, basically they are the only way to use 'wide' Unicode characters on > 16bit Unicode builds. > > I think a unicode string of length 2 should be able to coerce into a > Py_UCS4 value at runtime instead of raising the current exception because > it's too long.
Sure, that's fine by me. > For the opposite direction, integer to unicode string, you > already get a string of length 2 on narrow builds, that's how > unichr()/chr() work in Python 2/3. So, in a way, it's actually more > consistent with how narrow builds work today. OK. > The only reason this isn't > currently working in Cython is that Py_UNICODE is too small on narrow > builds to represent the larger Unicode code points. If we switched to > Py_UCS4, the problem would go away in narrow builds now and code could be > written today that would easily continue to work efficiently in a post-PEP > CPython as it wouldn't rely on the deprecated (and then inefficient) > Py_UNICODE type anymore. > > What about supporting surrogate pairs in 'in' tests only on narrow > platforms? I mean, we could simply duplicate the search code for that, > depending on how large the code point value really is at runtime. That code > will become a lot more involved anyway when the PEP gets implemented. Sure. This shouldn't have non-negligible performance overhead for the simple case, and would be consistent with coercing to a 2-character Unicode as above then doing the Python in operator. >> Also, this would be inconsistant with >> python-level slicing, indexing, and range, right? > > Yes, it does not match well with slicing and indexing. That's the problem > with narrow builds in both CPython and Cython. Only the PEP can fix that by > basically dropping the restrictions of a narrow build. Lets let indexing do what indexing does. - Robert _______________________________________________ Cython-dev mailing list Cython-dev@codespeak.net http://codespeak.net/mailman/listinfo/cython-dev