Florian Weimer, 28.01.2011 15:27:
* Stefan Behnel:The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this: def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings.Not really, because UTF-8 is quite search-friendly. (The if would have to invoke a memmem()-like primitive.) Random subscrips are problematic. However, why would one want to write loops like the above? Don't you have to take combining characters (comprising multiple codepoints) into account most of the time when you look at individual characters? Then UTF-32 does not offer much of a simplification.
Hmm, I think this discussion is pointless. Regardless of the memory layout, you can always go down to the byte level and use an efficient (multi-)substring search algorithm. (which is obviously helped if you know the layout at compile time *wink*)
Bad example, I guess. Stefan _______________________________________________ Python-Dev mailing list [email protected] http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
